100+ datasets found
  1. structured-wikipedia

    • huggingface.co
    Updated Sep 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2024). structured-wikipedia [Dataset]. https://huggingface.co/datasets/wikimedia/structured-wikipedia
    Explore at:
    Dataset updated
    Sep 16, 2024
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for Wikimedia Structured Wikipedia

      Dataset Description
    
    
    
    
    
      Dataset Summary
    

    Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.

  2. u

    Structural Metadata from ArXiv Articles

    • ebiquity.umbc.edu
    Updated Sep 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Rahman (2017). Structural Metadata from ArXiv Articles [Dataset]. https://ebiquity.umbc.edu/resource/html/id/374/Structural-Metadata-from-ArXiv-Articles
    Explore at:
    zip compressed json object(566 megabytes, compressed)Available download formats
    Dataset updated
    Sep 1, 2017
    Authors
    Muhammad Rahman
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    1991 - 2016
    Description

    The dataset contains metadata encoded in JSON and extracted from more than one million arXiv articles that were put online before the end of 2016. The metadata includes the arXiv id, category names, title, author names, abstract, link to article, publication date and table of contents.

  3. Dataset: A Systematic Literature Review on the topic of High-value datasets

    • zenodo.org
    • data.niaid.nih.gov
    bin, png, txt
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anastasija Nikiforova; Anastasija Nikiforova; Nina Rizun; Nina Rizun; Magdalena Ciesielska; Magdalena Ciesielska; Charalampos Alexopoulos; Charalampos Alexopoulos; Andrea Miletič; Andrea Miletič (2024). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. http://doi.org/10.5281/zenodo.8075918
    Explore at:
    png, bin, txtAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anastasija Nikiforova; Anastasija Nikiforova; Nina Rizun; Nina Rizun; Magdalena Ciesielska; Magdalena Ciesielska; Charalampos Alexopoulos; Charalampos Alexopoulos; Andrea Miletič; Andrea Miletič
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb)
    It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.


    The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

    ***Methodology***

    To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

    These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

    To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

    ***Test procedure***
    Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study.
    The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx)
    The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

    ***Description of the data in this data set***

    Protocol_HVD_SLR provides the structure of the protocol
    Spreadsheets #1 provides the filled protocol for relevant studies.
    Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

    The information on each selected study was collected in four categories:
    (1) descriptive information,
    (2) approach- and research design- related information,
    (3) quality-related information,
    (4) HVD determination-related information

    Descriptive information
    1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet
    2) Complete reference - the complete source information to refer to the study
    3) Year of publication - the year in which the study was published
    4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter}
    5) DOI / Website- a link to the website where the study can be found
    6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science
    7) Availability in OA - availability of an article in the Open Access
    8) Keywords - keywords of the paper as indicated by the authors
    9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

    Approach- and research design-related information
    10) Objective / RQ - the research objective / aim, established research questions
    11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.)
    12) Contributions - the contributions of the study
    13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach?
    14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared?
    15) Period under investigation - period (or moment) in which the study was conducted
    16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

    Quality- and relevance- related information
    17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)?
    18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

    HVD determination-related information
    19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term?
    20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output")
    21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description)
    22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles?
    23) Data - what data do HVD cover?
    24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)


    ***Format of the file***
    .xls, .csv (for the first spreadsheet only), .odt, .docx

    ***Licenses or restrictions***
    CC-BY

    For more info, see README.txt

  4. Z

    AJOL dataset: structured metadata of articles and journals indexed in...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alonso-Álvarez, Patricia (2025). AJOL dataset: structured metadata of articles and journals indexed in African Journals Online [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14899379
    Explore at:
    Dataset updated
    Mar 10, 2025
    Dataset authored and provided by
    Alonso-Álvarez, Patricia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset of African Journals Online publications and journals (last update: February 2024). The dataset contains metadata for articles and journals indexed in African Journals Online (AJOL). It provides the information contained in AJOL in a structured format that can be downloaded and used easily. It also contains a unique identifier matching AJOL articles with their OpenAlex records in order to facilitate the use, comparison, and combination of both data sources.

    Details about the download, methods, and findings are reported in the following preprint:

    Alonso-Álvarez, P. (2025). A small step towards the epistemic decentralization of science: a dataset of journals and publications indexed in African Journals Online. Zenodo. 10.5281/zenodo.14900054

    Detailed information on the database construction process is reported in the following file:

    ajol_database_report.pdf

    Data files:

    ajol_journals.csv: contains metadata from journals, including title, eISSN, ISSN print, country, JPPS category, and open access status (binary for diamond journals).

    ajol_journals_area.csv: related journals to their AJOL research area categories. Journals can belong up to three categories.

    ajol_pub.csv: contains articles’ metadata, including journal identifiers, article URL, doi, issue, volume, date, year, title, first page, and last page.

    ajol_pub_author.csv: relates articles to their authors.

    ajol_pub_keyword.csv: includes article keywords.

    ajol_pub_openalex.csv: relates AJOL articles to their OpenAlex records using the unique identifiers of each data source.

    readme.csv: contains the description of the variables in all data files.

    ajol_database_report.pdf: detailed information on the database construction process.

  5. c

    Medium articles dataset

    • crawlfeeds.com
    json, zip
    Updated Jul 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Medium articles dataset [Dataset]. https://crawlfeeds.com/datasets/medium-articles-dataset
    Explore at:
    json, zipAvailable download formats
    Dataset updated
    Jul 23, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Buy Medium Articles Dataset – 500K+ Published Articles in JSON Format

    Get access to a premium Medium articles dataset containing 500,000+ curated articles with metadata including author profiles, publication dates, reading time, tags, claps, and more. Ideal for natural language processing (NLP), machine learning, content trend analysis, and AI model training.

    Use Cases:

    • Training language models (LLMs)

    • Analyzing content trends and engagement

    • Sentiment and text classification

    • SEO research and author profiling

    • Academic or commercial research

    Why Choose This Dataset?

    • High-volume, cleanly structured JSON

    • Ideal for developers, researchers, and data scientists

    • Easy integration with Python, R, SQL, and other data pipelines

    • Affordable and ready-to-use

  6. h

    Wikipedia-Articles

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data, Wikipedia-Articles [Dataset]. https://huggingface.co/datasets/BrightData/Wikipedia-Articles
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Bright Data
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "BrightData/Wikipedia-Articles"

      Dataset Summary
    

    Explore a collection of millions of Wikipedia articles with the Wikipedia dataset, comprising over 1.23M structured records and 10 data fields updated and refreshed regularly. Each entry includes all major data points such as timestamp, URLs, article titles, raw and cataloged text, images, "see also" references, external links, and a structured table of contents. For a complete list of data points, please… See the full description on the dataset page: https://huggingface.co/datasets/BrightData/Wikipedia-Articles.

  7. f

    Development of the number of categorized articles and of authors.

    • plos.figshare.com
    xls
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iassen Halatchliyski; Ulrike Cress (2023). Development of the number of categorized articles and of authors. [Dataset]. http://doi.org/10.1371/journal.pone.0111958.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Iassen Halatchliyski; Ulrike Cress
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Development of the number of categorized articles and of authors.

  8. Z

    Data from: Dataset for the EPSL article: Structure and dynamics of the Tonga...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhiteng Yu (2022). Dataset for the EPSL article: Structure and dynamics of the Tonga subduction zone: new insight from P-wave anisotropic tomography [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7076276
    Explore at:
    Dataset updated
    Oct 7, 2022
    Dataset provided by
    Dapeng Zhao
    Zhiteng Yu
    Jiabiao Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The obtained 3-D P-wave anisotropic and isotropic velocity models in the Tonga subduction zone.

    Please find this article: Z. Yu*, D. Zhao and J. Li*. Structure and dynamics of the Tonga subduction zone: New insight from P-wave anisotropic tomography. Earth and Planetary Science Letters, https://doi.org/10.1016/j.epsl.2022.117844

  9. Extended Wikipedia Multimodal Dataset

    • kaggle.com
    Updated Apr 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleh Onyshchak (2020). Extended Wikipedia Multimodal Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1058023
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Oleh Onyshchak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Wikipedia Featured Articles multimodal dataset

    Overview

    • This is a multimodal dataset of featured articles containing 5,638 articles and 57,454 images.
    • Its superset of good articles is also hosted on Kaggle. It has six times more entries although with a little worse quality.

    It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.

    You can find more details in "Image Recommendation for Wikipedia Articles" thesis.

    Dataset structure

    The high-level structure of the dataset is as follows:

    .
    +-- page1 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    +-- page2 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    : 
    +-- pageN 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    
    labeldescription
    pageNis the title of N-th Wikipedia page and contains all information about the page
    text.jsontext of the page saved as JSON. Please refer to the details of JSON schema below.
    meta.jsona collection of all images of the page. Please refer to the details of JSON schema below.
    imageNis the N-th image of an article, saved in jpg format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title.

    text.JSON Schema

    Below you see an example of how data is stored:

    {
     "title": "Naval Battle of Guadalcanal",
     "id": 405411,
     "url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
     "html": "... 
    

    ...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }

    keydescription
    titlepage title
    idunique page id
    urlurl of a page on Wikipedia
    htmlHTML content of the article
    wikitextwikitext content of the article

    Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.

    meta.JSON Schema

    {
     "img_meta": [
      {
       "filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
       "title": "IronbottomSound.jpg",
       "parsed_title": "ironbottom sound",
       "url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
       "is_icon": False,
       "on_commons": True,
       "description": "A U.S. destroyer steams up what later became known as ...",
       "caption": "Ironbottom Sound. The majority of the warship surface ...",
       "headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
       "features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
       },
       ...
      ]
    }
    
    keydescription
    filenameunique image id, md5 hashcode of original image title
    titleimage title retrieved from Commons, if applicable
    parsed_titleimage title split into words, i.e. "helloWorld.jpg" -> "hello world"
    urlurl of an image on Wikipedia
    is_iconTrue if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it
    on_commonsTrue if image is available from Wikimedia Commons dataset
    descriptiondescription of an image parsed from Wikimedia Commons page, if available
    captioncaption of an image parsed from Wikipedia article, if available
    headingslist of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading
    featuresoutput of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in jpeg format with fixed width of 600px. Practically, it is a list of floats with len = 2048

    Collection method

    Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.

  10. Multilingual Historical News Article Extraction and Classification Dataset

    • zenodo.org
    csv
    Updated Jan 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johanna Mauermann; Carlos-Emiliano González-Gallardo; Sarah Oberbichler; Sarah Oberbichler; Johanna Mauermann; Carlos-Emiliano González-Gallardo (2025). Multilingual Historical News Article Extraction and Classification Dataset [Dataset]. http://doi.org/10.57967/hf/3965
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johanna Mauermann; Carlos-Emiliano González-Gallardo; Sarah Oberbichler; Sarah Oberbichler; Johanna Mauermann; Carlos-Emiliano González-Gallardo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 20, 2024
    Description

    This dataset was created specifically to test LLMs capabilities in processing and extracting topic-specific articles from historical unstructured newspaper issues. While traditional article separation tasks rely on layout information or a combination of layout and semantic understanding, this dataset evaluates a novel approach using OCR'd text and context understanding. This method can considerably improve the corpus building process for individual researchers working on specific topics such as migration or disasters. The dataset consists of French, German, and English newspapers from 1909 and contains multiple layers of information: detailed metadata about each newspaper issue (including identifiers, titles, dates, and institutional information), full-text content of newspaper pages or sections, context window for processing, and human-annotated ground truth extractions. The dataset is structured to enable three-step evaluation of LLMs: first, their ability to classify content as relevant or not relevant to a specific topic (such as the 1908 Messina earthquake), second, their accuracy in extracting complete relevant articles from the broader newspaper text, and third, to correctly mark beginning and end of the articles, especially when several articles where published in the same newspaper issue. By providing human-annotated ground truth, the dataset allows for systematic assessment of how well LLMs can understand historical text, maintain contextual relevance, and perform precise information extraction. This testing framework helps evaluate LLMs' effectiveness in handling real-world historical document processing tasks while maintaining accuracy and contextual understanding.

  11. AmericanStories

    • huggingface.co
    • opendatalab.com
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dell Research Harvard (2023). AmericanStories [Dataset]. http://doi.org/10.57967/hf/0757
    Explore at:
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    Dell Technologieshttp://dell.com/
    Authors
    Dell Research Harvard
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    American Stories offers high-quality structured data from historical newspapers suitable for pre-training large language models to enhance the understanding of historical English and world knowledge. It can also be integrated into external databases of retrieval-augmented language models, enabling broader access to historical information, including interpretations of political events and intricate details about people's ancestors. Additionally, the structured article texts facilitate the application of transformer-based methods for popular tasks like detecting reproduced content, significantly improving accuracy compared to traditional OCR methods. American Stories serves as a substantial and valuable dataset for advancing multimodal layout analysis models and other multimodal applications.

  12. d

    Characteristics, utilization and influence of viewpoint articles from the...

    • datadryad.org
    • zenodo.org
    zip
    Updated Nov 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katie Tayler-Smith (2020). Characteristics, utilization and influence of viewpoint articles from the Structured Operational Research and Training Initiative (SORT IT) – 2009-2020 [Dataset]. http://doi.org/10.5061/dryad.fj6q573sk
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 19, 2020
    Dataset provided by
    Dryad
    Authors
    Katie Tayler-Smith
    Time period covered
    Nov 18, 2020
    Description

    Background: The Structured Operational Research and Training Initiative (SORT IT) teaches the practical skills of conducting and publishing operational research (OR) to influence health policy and/or practice. In addition to original research articles, viewpoint articles are also produced and published as secondary outputs of SORT IT courses. We assessed the characteristics, use and influence of viewpoint articles derived from all SORT IT courses.

    Methods: This was a cross-sectional study involving all published viewpoint articles derived from the SORT IT courses held between August 2009 and March 2020. Characteristics of these papers were sourced from the papers themselves and from SORT-IT members involved in writing the papers. Data on use were sourced from the metrics provided on the online publishing platforms and from Google Scholar. Influence on policy and practice was self-assessed by the authors of the papers and was performed only for papers deemed to be ‘calls for action’.

    ...

  13. The structure of a primary research article.

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maureen A. Carey; Kevin L. Steiner; William A. Petri Jr (2023). The structure of a primary research article. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008032.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Maureen A. Carey; Kevin L. Steiner; William A. Petri Jr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The structure of a primary research article.

  14. Human Written Text

    • kaggle.com
    Updated May 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Youssef Elebiary (2025). Human Written Text [Dataset]. https://www.kaggle.com/datasets/youssefelebiary/human-written-text
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 13, 2025
    Dataset provided by
    Kaggle
    Authors
    Youssef Elebiary
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This dataset contains 20000 pieces of text collected from Wikipedia, Gutenberg, and CNN/DailyMail. The text is cleaned by replacing symbols such as (.*?/) with a white space using automatic scripts and regex.

    Data Source Distribution

    1. 10,000 Wikipedia Articles: From the 20220301 dump.
    2. 3,000 Gutenberg Books: Via the GutenDex API.
    3. 7,000 CNN/DailyMail News Articles: From the CNN/DailyMail 3.0.0 dataset.

    Why These Sources

    The data was collected from these source to ensure the highest level of integrity against AI generated text. * Wikipedia: The 20220301 dataset was chosen to minimize the chance of including articles generated or heavily edited by AI. * Gutenberg: Books from this source are guaranteed to be written by real humans and span various genres and time periods. * CNN/DailyMail: These news articles were written by professional journalists and cover a variety of topics, ensuring diversity in writing style and subject matter.

    Dataset Structure

    The dataset consists of 5 CSV files. 1. CNN_DailyMail.csv: Contains all processed news articles. 2. Gutenberg.csv: Contains all processed books. 3. Wikipedia.csv: Contains all processed Wikipedia articles. 4. Human.csv: Combines all three datasets in order. 5. Shuffled_Human.csv: This is the randomly shuffled version of Human.csv.

    Each file has 2 columns: - Title: The title of the item. - Text: The content of the item.

    Uses

    This dataset is suitable for a wide range of NLP tasks, including: - Training models to distinguish between human-written and AI-generated text (Human/AI classifiers). - Training LSTMs or Transformers for chatbots, summarization, or topic modeling. - Sentiment analysis, genre classification, or linguistic research.

    Disclaimer

    While the data was collected from such sources, the data may not be 100% pure from AI generated text. Wikipedia articles may reflect systemic biases in contributor demographics. CNN/DailyMail articles may focus on specific news topics or regions.

    For details on how the dataset was created, click here to view the Kaggle notebook used.

    Licensing

    This dataset is published under the MIT License, allowing free use for both personal and commercial purposes. Attribution is encouraged but not required.

  15. Development of the number of articles with new contributions per period.

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iassen Halatchliyski; Ulrike Cress (2023). Development of the number of articles with new contributions per period. [Dataset]. http://doi.org/10.1371/journal.pone.0111958.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Iassen Halatchliyski; Ulrike Cress
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Development of the number of articles with new contributions per period.

  16. A Structured Human-Annotated Dataset for Food Extrusion Literature

    • researchdata.edu.au
    • data.csiro.au
    datadownload
    Updated Feb 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan Pennells; Mr Jordan Pennells; Mr Jordan Pennells (2025). A Structured Human-Annotated Dataset for Food Extrusion Literature [Dataset]. http://doi.org/10.25919/R4Y6-R260
    Explore at:
    datadownloadAvailable download formats
    Dataset updated
    Feb 27, 2025
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Jordan Pennells; Mr Jordan Pennells; Mr Jordan Pennells
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Oct 17, 2022 - Feb 26, 2025
    Description

    This dataset is a manually curated collection of structured data extracted from peer-reviewed food extrusion research articles. The dataset captures key parameters relevant to food extrusion processes, including equipment configurations, processing conditions, formulation details, and characterization methods. It is intended to support literature synthesis, meta-analyses, and knowledge representation in food extrusion research. This dataset provides a searchable, structured repository for researchers to efficiently access and analyse trends in food extrusion studies beyond what is typically available in standard academic databases. Lineage: This dataset was manually curated from 335 peer-reviewed food extrusion research articles sourced from the Web of Science database. The literature search used the following search syntax: "extru*" (Topic) AND "food" (Topic) NOT “packaging” (Topic). WoS Category filter: Food Science Technology, Nutrition & Dietetics, and Agriculture Dairy Animal Science. Key parameters—including equipment configurations, processing conditions, formulation details, and characterisation methods—were extracted, structured, and categorised by a domain expert in food engineering following a predefined schema. Citation screening was performed to ensure dataset quality.

  17. Z

    Dataset for the article: "Weak genetic structure despite strong genomic...

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bekkevold, Dorte (2020). Dataset for the article: "Weak genetic structure despite strong genomic signal in lesser sandeel in the North Sea" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3458887
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Bekkevold, Dorte
    Le Moan, Alan
    Christensen, Asbjørn
    van Deurs, Mikael
    Hemmer-Hansen, Jakob
    Jiménez-Mena, Belén
    Mosegaard, Henrik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    North Sea
    Description

    Dataset used for the article: "Weak genetic structure despite strong genomic signal in lesser sandeel in the North Sea".

    Dataset consists on a VCF file from 471 individuals of lesser sandeel, Ammodytes marinus (L.). This VCF is the end product of the bioinformatic analysis described in the paper Jimenez-Mena et al. (2019). Data was obtained from double-digest Restriction-site Associated DNA (ddRAD) sequencing. More information can be obtained in Methods of the article. The information of each of the individuals in the VCF is also included as a separate file, as well as the supplementary tables of the article.

  18. d

    Characteristics, utilization and influence of viewpoint articles from the...

    • search.dataone.org
    Updated May 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katie Tayler-Smith (2025). Characteristics, utilization and influence of viewpoint articles from the Structured Operational Research and Training Initiative (SORT IT) – 2009-2020 [Dataset]. http://doi.org/10.5061/dryad.fj6q573sk
    Explore at:
    Dataset updated
    May 13, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Katie Tayler-Smith
    Time period covered
    Jan 1, 2020
    Description

    Background: The Structured Operational Research and Training Initiative (SORT IT) teaches the practical skills of conducting and publishing operational research (OR) to influence health policy and/or practice. In addition to original research articles, viewpoint articles are also produced and published as secondary outputs of SORT IT courses. We assessed the characteristics, use and influence of viewpoint articles derived from all SORT IT courses.

    Methods: This was a cross-sectional study involving all published viewpoint articles derived from the SORT IT courses held between August 2009 and March 2020. Characteristics of these papers were sourced from the papers themselves and from SORT-IT members involved in writing the papers. Data on use were sourced from the metrics provided on the online publishing platforms and from Google Scholar. Influence on policy and practice was self-assessed by the authors of the papers and was performed only for papers deemed to be ‘calls for action’. ...

  19. Publication and dataset for article "Structure and transport properties of...

    • zenodo.org
    bin, pdf, txt, zip
    Updated Jan 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    E. Edmund; T. Bi; Z.M. Geballe; K. Brugman; J.-F. Lin; S. Chariton; V. B. Prakapenka; Ján Minár; R. E. Cohen; A. F. Goncharov; E. Edmund; T. Bi; Z.M. Geballe; K. Brugman; J.-F. Lin; S. Chariton; V. B. Prakapenka; Ján Minár; R. E. Cohen; A. F. Goncharov (2025). Publication and dataset for article "Structure and transport properties of FeS at planetary core conditions" [Dataset]. http://doi.org/10.5281/zenodo.14602522
    Explore at:
    txt, zip, bin, pdfAvailable download formats
    Dataset updated
    Jan 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    E. Edmund; T. Bi; Z.M. Geballe; K. Brugman; J.-F. Lin; S. Chariton; V. B. Prakapenka; Ján Minár; R. E. Cohen; A. F. Goncharov; E. Edmund; T. Bi; Z.M. Geballe; K. Brugman; J.-F. Lin; S. Chariton; V. B. Prakapenka; Ján Minár; R. E. Cohen; A. F. Goncharov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Publication and dataset for article "Structure and transport properties of FeS at planetary core conditions" in Earth and Planetary Science Letters, Volume 646, 15 November 2024, 118959.

  20. T

    scientific_papers

    • tensorflow.org
    • huggingface.co
    Updated Dec 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). scientific_papers [Dataset]. https://www.tensorflow.org/datasets/catalog/scientific_papers
    Explore at:
    Dataset updated
    Dec 23, 2022
    Description

    Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

    Both "arxiv" and "pubmed" have two features:

    • article: the body of the document, pagragraphs seperated by "/n".
    • abstract: the abstract of the document, pagragraphs seperated by "/n".
    • section_names: titles of sections, seperated by "/n".

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('scientific_papers', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wikimedia (2024). structured-wikipedia [Dataset]. https://huggingface.co/datasets/wikimedia/structured-wikipedia
Organization logo

structured-wikipedia

wikimedia/structured-wikipedia

Explore at:
223 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 16, 2024
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dataset Card for Wikimedia Structured Wikipedia

  Dataset Description





  Dataset Summary

Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.

Search
Clear search
Close search
Google apps
Main menu