97 datasets found
  1. F

    Data from: A Neural Approach for Text Extraction from Scholarly Figures

    • data.uni-hannover.de
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 20, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A Neural Approach for Text Extraction from Scholarly Figures

    This is the readme for the supplemental data for our ICDAR 2019 paper.

    You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

    If you found this dataset useful, please consider citing our paper:

    @inproceedings{DBLP:conf/icdar/MorrisTE19,
     author  = {David Morris and
            Peichen Tang and
            Ralph Ewerth},
     title   = {A Neural Approach for Text Extraction from Scholarly Figures},
     booktitle = {2019 International Conference on Document Analysis and Recognition,
            {ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
     pages   = {1438--1443},
     publisher = {{IEEE}},
     year   = {2019},
     url    = {https://doi.org/10.1109/ICDAR.2019.00231},
     doi    = {10.1109/ICDAR.2019.00231},
     timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
     biburl  = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
     bibsource = {dblp computer science bibliography, https://dblp.org}
    }
    

    This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

    Datasets

    We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

    Testing

    These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

    Validation

    The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

    Training

    We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

    Code

    We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

    Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

    We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

    We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

    Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.

  2. m

    Amharic text dataset extracted from memes for hate speech detection or...

    • data.mendeley.com
    Updated Jun 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mequanent Degu (2023). Amharic text dataset extracted from memes for hate speech detection or classification [Dataset]. http://doi.org/10.17632/gw3fdtw5v7.2
    Explore at:
    Dataset updated
    Jun 8, 2023
    Authors
    Mequanent Degu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    the dataset is collected from social media such as facebook and telegram. the dataset is further processed. the collection are orginal_cleaned: this dataset is neither stemed nor stopword are remove: stopword_removed: in this dataset stopwords are removed but not stemmed and in stemed datset is stemmed and stopwords are removed. stemming is done using hornmorpho developed by Michael Gesser( available at https://github.com/hltdi/HornMorpho) all datasets are normalized and free from noise such as punctuation marks and emojs.

  3. Z

    Softcite Dataset: A dataset of software mentions in research publications

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Howison; Patrice Lopez; Caifan Du; Hannah Cohoon (2021). Softcite Dataset: A dataset of software mentions in research publications [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4444074
    Explore at:
    Dataset updated
    Jan 17, 2021
    Dataset provided by
    The University of Texas at Austin
    SCIENCE-MINER
    Authors
    James Howison; Patrice Lopez; Caifan Du; Hannah Cohoon
    Description

    The Softcite dataset is a gold-standard dataset of software mentions in research publications, a free resource primarily for software entity recognition in scholarly text. This is the first release of this dataset.

    What's in the dataset

    With the aim of facilitating software entity recognition efforts at scale and eventually increased visibility of research software for the due credit of software contributions to scholarly research, a team of trained annotators from Howison Lab at the University of Texas at Austin annotated 4,093 software mentions in 4,971 open access research publications in biomedicine (from PubMed Central Open Access collection) and economics (from Unpaywall open access services). The annotated software mentions, along with their publisher, version, and access URL, if mentioned in the text, as well as those publications annotated as containing no software mentions, are all included in the released dataset as a TEI/XML corpus file.

    For understanding the schema of the Softcite corpus, its design considerations, and provenance, please refer to our paper included in this release (preprint version).

    Use scenarios

    The release of the Softcite dataset is intended to encourage researchers and stakeholders to make research software more visible in science, especially to academic databases and systems of information retrieval; and facilitate interoperability and collaboration among similar and relevant efforts in software entity recognition and building utilities for software information retrieval. This dataset can also be useful for researchers investigating software use in academic research.

    Current release content

    softcite-dataset v1.0 release includes:

    The Softcite dataset corpus file: softcite_corpus-full.tei.xml

    Softcite Dataset: A Dataset of Software Mentions in Biomedical and Economic Research Publications, our paper that describes the design consideration and creation process of the dataset: Softcite_Dataset_Description_RC.pdf. (This is a preprint version of our forthcoming publication in the Journal of the Association for Information Science and Technology.)

    The Softcite dataset is licensed under a Creative Commons Attribution 4.0 International License.

    If you have questions, please start a discussion or issue in the howisonlab/softcite-dataset Github repository.

  4. Examining the Capacity of Text Mining and Software Metrics in Vulnerability...

    • data.europa.eu
    unknown
    Updated Sep 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2023). Examining the Capacity of Text Mining and Software Metrics in Vulnerability Prediction [dataset] [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-8369963?locale=fr
    Explore at:
    unknown(79359120)Available download formats
    Dataset updated
    Sep 21, 2023
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the extension of a publicly available dataset that was published initially by Ferenc et al. in their paper: “Ferenc, R.; Hegedus, P.; Gyimesi, P.; Antal, G.; Bán, D.; Gyimóthy, T. Challenging machine learning algorithms in predicting vulnerable javascript functions. 2019 IEEE/ACM 7th InternationalWorkshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE). IEEE, 2019, pp. 8–14.” The dataset contained software metrics for source code functions written in JavaScript (JS) programming language. Each function was labeled as vulnerable or clean. The authors gathered vulnerabilities from publicly available vulnerability databases. In our paper entitled: “Examining the Capacity of Text Mining and Software Metrics in Vulnerability Prediction” and cited as: “Kalouptsoglou I, Siavvas M, Kehagias D, Chatzigeorgiou A, Ampatzoglou A. Examining the Capacity of Text Mining and Software Metrics in Vulnerability Prediction. Entropy. 2022; 24(5):651. https://doi.org/10.3390/e24050651” , we presented an extended version of the dataset by extracting textual features for the labeled JS functions. In particular, we got the dataset provided by Ferenc et al. in CSV format and then we gathered all the GitHub URLs of the dataset's functions (i.e., methods). Using these URLs, we collected the source code of the corresponding JS files from GitHub. Subsequently, by utilizing the start and end line information for every function, we cut off the code of the functions. Each function was then tokenized to construct a list of tokens per function. To extract text features, we used a text mining technique called sequences of tokens. As a result, we created a repository with all methods' source code, the token sequences of each method, and their labels. To boost the generalizability of type-specific tokens, all comments were eliminated, as well as all integers and strings, which were replaced with two unique IDs. The dataset contains 12,106 JavaScript functions, from which 1,493 are considered vulnerable. This dataset was created and utilized during the Vulnerability Prediction Task of the Horizon2020 IoTAC Project as training and evaluation data for the construction of vulnerability prediction models. The dataset is provided in the csv format. Each row of the csv file has the following parts: Label: Flag with values ‘1’ for vulnerable and ‘0’ for non-vulnerable methods Name: The name of the JavaScript method Longname: The longname of the JavaScript method Path: The path of the file of the method in the repository Full_repo_path: The GitHub URL of the file of the method TokenX: Each next row corresponds to each token included in the method

  5. GitHub Public Pull Request Comments

    • kaggle.com
    zip
    Updated Sep 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter (2023). GitHub Public Pull Request Comments [Dataset]. https://www.kaggle.com/datasets/pelmers/github-public-pull-request-comments/code
    Explore at:
    zip(3307386118 bytes)Available download formats
    Dataset updated
    Sep 6, 2023
    Authors
    Peter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset used for the master's thesis "LLMs for Code Comment Consistency." Covers the languages Go, Java, JavaScript, TypeScript, and Python. All data is mined from permissively-licensed GitHub public projects with at least 25 stars and 25 pull requests submitted at the time of access, based on the GitHub Public Repository Metadata Dataset.

    This dataset pertains specifically to pull request comments that are made on files. In other words, every comment in this dataset is linked to a specific file in a pull request.

    What can I do with this data?

    Anything you want, of course, but here are some starter ideas: - Sentiment analysis of comments, is there a correlation between number of contributions and positivity of reviews? - Pull request comment generation: can we automatically make code review comments? - PR text mining: can we mine out examples of a specific type of comment? (in my project, this was comments about function documentation)

    The mining code is publicly accessible at https://github.com/pelmers/llms-for-code-comment-consistency/tree/main/rq3

    Each file is a JSON object where each key is a Github repository, and each value is a pull request comment in that repository. Example: { "trekhleb/javascript-algorithms": [{ "html_url": "https://github.com/trekhleb/javascript-algorithms/pull/101#discussion_r204437121", "path": "src/algorithms/string/knuth-morris-pratt/knuthMorrisPratt.js", "line": 33, "body": "Please take a look at the comments to the tests above. No need to do this checking.", "user": "trekhleb", "diff_hunk": "@@ -30,6 +30,10 @@ function buildPatternTable(word) { * @return {number} */ export default function knuthMorrisPratt(text, word) { + if (word.length === 0) {", "author_association": "OWNER", "commit_id": "618d0962025ff1116979560a0bfa0ed1660f129e", "id": 204437121, "repo": "trekhleb/javascript-algorithms" }, ...] }

  6. Friends - R Package Dataset

    • kaggle.com
    zip
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Yukio Imafuko (2024). Friends - R Package Dataset [Dataset]. https://www.kaggle.com/datasets/lucasyukioimafuko/friends-r-package-dataset
    Explore at:
    zip(2018791 bytes)Available download formats
    Dataset updated
    Nov 11, 2024
    Authors
    Lucas Yukio Imafuko
    Description

    The whole data and source can be found at https://emilhvitfeldt.github.io/friends/

    "The goal of friends to provide the complete script transcription of the Friends sitcom. The data originates from the Character Mining repository which includes references to scientific explorations using this data. This package simply provides the data in tibble format instead of json files."

    Content

    • friends.csv - Contains the scenes and lines for each character, including season and episodes.
    • friends_emotions.csv - Contains sentiments for each scene - for the first four seasons only.
    • friends_info.csv - Contains information regarding each episode, such as imdb_rating, views, episode title and directors.

    Uses

    • Text mining, sentiment analysis and word statistics.
    • Data visualizations.
  7. m

    Austin_Survey_for_MDCOR_Analyses

    • data.mendeley.com
    Updated Nov 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuel Gonzalez Canche (2022). Austin_Survey_for_MDCOR_Analyses [Dataset]. http://doi.org/10.17632/nb7yvhjvzk.1
    Explore at:
    Dataset updated
    Nov 14, 2022
    Authors
    Manuel Gonzalez Canche
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Austin
    Description

    The city of Austin has administered a community survey for the 2015, 2016, 2017, 2018 and 2019 years (https://data.austintexas.gov/City-Government/Community-Survey/s2py-ceb7), to “assess satisfaction with the delivery of the major City Services and to help determine priorities for the community as part of the City’s ongoing planning process.” To directly access this dataset from the city of Austin’s website, you can follow this link https://cutt.ly/VNqq5Kd. Although we downloaded the dataset analyzed in this study from the former link, given that the city of Austin is interested in continuing administering this survey, there is a chance that the data we used for this analysis and the data hosted in the city of Austin’s website may differ in the following years. Accordingly, to ensure the replication of our findings, we recommend researchers to download and analyze the dataset we employed in our analyses, which can be accessed at the following link https://github.com/democratizing-data-science/MDCOR/blob/main/Community_Survey.csv. Replication Features or Variables The community survey data has 10,684 rows and 251 columns. Of these columns, our analyses will rely on the following three indicators that are taken verbatim from the survey: “ID”, “Q25 - If there was one thing you could share with the Mayor regarding the City of Austin (any comment, suggestion, etc.), what would it be?", and “Do you own or rent your home?”

  8. GitHub Public Pull Request Comments

    • zenodo.org
    zip
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2023). GitHub Public Pull Request Comments [Dataset]. http://doi.org/10.5281/zenodo.10138317
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Over 13 MILLION pull request comments

    Dataset used for the master's thesis "LLMs for Code Comment Consistency." Covers the languages Go, Java, JavaScript, TypeScripp, and Python. All data is mined from permissively-licensed GitHub public projects with at least 25 stars and 25 pull requests submitted at the time of access.

    This dataset pertains specifically to **pull request comments that are made on files.** In other words, every comment in this dataset is linked to a specific file in a pull request.

    ### What can I do with this data?

    Anything you want, of course, but here are some starter ideas:

    - Sentiment analysis of comments, is there a correlation between number of contributions and positivity of reviews?

    - Pull request comment generation: can we automatically make code review comments?

    - PR text mining: can we mine out examples of a specific type of comment? (in my project, this was comments about function documentation)

    The mining code is publicly accessible.

    Each file is a JSON object where each key is a Github repository, and each value is a pull request comment in that repository.

  9. Steam Dataset 2025: Multi-Modal Gaming Analytics

    • kaggle.com
    zip
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CrainBramp (2025). Steam Dataset 2025: Multi-Modal Gaming Analytics [Dataset]. https://www.kaggle.com/datasets/crainbramp/steam-dataset-2025-multi-modal-gaming-analytics
    Explore at:
    zip(12478964226 bytes)Available download formats
    Dataset updated
    Oct 7, 2025
    Authors
    CrainBramp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Steam Dataset 2025: Multi-Modal Gaming Analytics Platform

    The first multi-modal Steam dataset with semantic search capabilities. 239,664 applications collected from official Steam Web APIs with PostgreSQL database architecture, vector embeddings for content discovery, and comprehensive review analytics.

    Made by a lifelong gamer for the gamer in all of us. Enjoy!🎮

    GitHub Repository https://github.com/vintagedon/steam-dataset-2025

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F4b7eb73ac0f2c3cc9f0d57f37321b38f%2FScreenshot%202025-10-18%20180450.png?generation=1760825194507387&alt=media" alt=""> 1024-dimensional game embeddings projected to 2D via UMAP reveal natural genre clustering in semantic space

    What Makes This Different

    Unlike traditional flat-file Steam datasets, this is built as an analytically-native database optimized for advanced data science workflows:

    ☑️ Semantic Search Ready - 1024-dimensional BGE-M3 embeddings enable content-based game discovery beyond keyword matching

    ☑️ Multi-Modal Architecture - PostgreSQL + JSONB + pgvector in unified database structure

    ☑️ Production Scale - 239K applications vs typical 6K-27K in existing datasets

    ☑️ Complete Review Corpus - 1,048,148 user reviews with sentiment and metadata

    ☑️ 28-Year Coverage - Platform evolution from 1997-2025

    ☑️ Publisher Networks - Developer and publisher relationship data for graph analysis

    ☑️ Complete Methodology & Infrastructure - Full work logs document every technical decision and challenge encountered, while my API collection scripts, database schemas, and processing pipelines enable you to update the dataset, fork it for customized analysis, learn from real-world data engineering workflows, or critique and improve the methodology

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F649e9f7f46c6ce213101d0948c89e8ac%2F4_price_distribution_by_top_10_genres.png?generation=1760824835918620&alt=media" alt=""> Market segmentation and pricing strategy analysis across top 10 genres

    What's Included

    Core Data (CSV Exports): - 239,664 Steam applications with complete metadata - 1,048,148 user reviews with scores and statistics - 13 normalized relational tables for pandas/SQL workflows - Genre classifications, pricing history, platform support - Hardware requirements (min/recommended specs) - Developer and publisher portfolios

    Advanced Features (PostgreSQL): - Full database dump with optimized indexes - JSONB storage preserving complete API responses - Materialized columns for sub-second query performance - Vector embeddings table (pgvector-ready)

    Documentation: - Complete data dictionary with field specifications - Database schema documentation - Collection methodology and validation reports

    Example Analysis: Published Notebooks (v1.0)

    Three comprehensive analysis notebooks demonstrate dataset capabilities. All notebooks render directly on GitHub with full visualizations and output:

    📊 Platform Evolution & Market Landscape

    View on GitHub | PDF Export
    28 years of Steam's growth, genre evolution, and pricing strategies.

    🔍 Semantic Game Discovery

    View on GitHub | PDF Export
    Content-based recommendations using vector embeddings across genre boundaries.

    🎯 The Semantic Fingerprint

    View on GitHub | PDF Export
    Genre prediction from game descriptions - demonstrates text analysis capabilities.

    Notebooks render with full output on GitHub. Kaggle-native versions planned for v1.1 release. CSV data exports included in dataset for immediate analysis.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F4079e43559d0068af00a48e2c31f0f1d%2FScreenshot%202025-10-18%20180214.png?generation=1760824950649726&alt=media" alt=""> *Steam platfor...

  10. U

    Replication Data for: A Review of Best Practice Recommendations for...

    • dataverse-staging.rdmc.unc.edu
    • datasearch.gesis.org
    Updated Nov 7, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Wesslen; Ryan Wesslen (2017). Replication Data for: A Review of Best Practice Recommendations for Text-Analysis in R (and a User Friendly App) [Dataset]. http://doi.org/10.15139/S3/R4W7ZS
    Explore at:
    csv(1070619), application/x-rlang-transport(1014184), pdf(76215), text/x-r-markdown(14242), text/x-r-markdown(12162), html(2930583), application/x-rlang-transport(2108553), docx(24677), html(2442743), html(1689406), text/markdown(1958), application/x-rlang-transport(1623238), text/x-r-markdown(12252)Available download formats
    Dataset updated
    Nov 7, 2017
    Dataset provided by
    UNC Dataverse
    Authors
    Ryan Wesslen; Ryan Wesslen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Replication materials for "A Review of Best Practice Recommendations for Text-Analysis in R (and a User Friendly App)". You can also find these materials on GitHub repo (https://github.com/wesslen/text-analysis-org-science) as well as the Shiny app in the GitHub repo (https://github.com/wesslen/topicApp).

  11. w

    Nanopubs extracted from DisGeNET v3.0.0.0

    • data.wu.ac.at
    rdf, trig gzip
    Updated Nov 24, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nanopublications (2015). Nanopubs extracted from DisGeNET v3.0.0.0 [Dataset]. https://data.wu.ac.at/odso/datahub_io/ZmM5NDU5Y2MtYTVhMy00MTc4LTg5ZDYtMWZjNWE5MDFmODI4
    Explore at:
    rdf, trig gzipAvailable download formats
    Dataset updated
    Nov 24, 2015
    Dataset provided by
    Nanopublications
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    1018735 nanopublications. These nanopubs were automatically extracted from the DisGeNET dataset. See also the main DisGeNET data on Datahub at https://datahub.io/dataset/disgenet.

    Download the content of this set of nanopublications from the server network using nanopub-java at https://github.com/Nanopublication/nanopub-java:

    $ np get -c -o nanopubs.trig RAVEKRW0m6Ly_PjmhcxCZMR5fYIlzzqjOWt1CgcwD_77c

  12. Z

    Ontology Enrichment from Texts (OET): A Biomedical Dataset for Concept...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Dec 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dong, Hang; Chen, Jiaoyan; He, Yuan; Horrocks, Ian (2023). Ontology Enrichment from Texts (OET): A Biomedical Dataset for Concept Discovery and Placement [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8043689
    Explore at:
    Dataset updated
    Dec 26, 2023
    Dataset provided by
    University of Manchester
    University of Oxford
    Authors
    Dong, Hang; Chen, Jiaoyan; He, Yuan; Horrocks, Ian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A biomedical dataset supporting ontology enrichment from texts, by concept discovery and placement, adapting the MedMentions dataset (PubMed abstracts) with SNOMED CT of versions in 2014 and 2017 under the Diseases (disorder) sub-category and the broader categories of Clinical finding, Procedure, and Pharmaceutical / biologic (CPP) product.

    The dataset is documented in the work, Ontology Enrichment from Texts: A Biomedical Dataset for Concept Discovery and Placement, on arXiv: https://arxiv.org/abs/2306.14704 (CIKM 2023). The companion code is available at https://github.com/KRR-Oxford/OET.

    Out-of-KB mention discovery (including the settings of mention-level data) is further partly documented in the work, Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking, on arXiv: https://arxiv.org/abs/2302.07189 (CIKM 2023).

    ver4: we made a version of mention-level data for out-of-KB discovery and concept placement separately: the former (for out-of-KB discovery) has out-of-KB mentions in training data, while the latter (for concept placement) has only out-of-KB mentions during the evaluation (validation and test) and not in the training data. Also, we split the original "test-NIL.jsonl" (now "test-NIL-all.jsonl") into "valid-NIL.jsonl" and "test-NIL.jsonl" for a better evaluation.

    ver3: we revised and updated mention-level data (syn_full, synonym augmentation setting) and the folder structure, and also updated the edge catalogues with complex edges.

    ver2: we revised the mention-level data by only keeping out-of-KB mentions (or "NIL" mentions) associated with one-hop edges (including leaf nodes, as ) and two-hop edges in the ontology (SNOMED CT 20140901).

    Acknowledgement of data sources and tools below:

  13. r

    Dataset for "Do LiU researchers publish data – and where? Dataset analysis...

    • researchdata.se
    • demo.researchdata.se
    • +1more
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaori Hoshi Larsson (2025). Dataset for "Do LiU researchers publish data – and where? Dataset analysis using ODDPub" [Dataset]. http://doi.org/10.5281/zenodo.15017715
    Explore at:
    Dataset updated
    Mar 19, 2025
    Dataset provided by
    Linköping University
    Authors
    Kaori Hoshi Larsson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the results from the ODDPubb text mining algorithm and the findings from manual analysis. Full-text PDFs of all articles parallel-published by Linköping University in 2022 were extracted from the institute's repository, DiVA. These were analyzed using the ODDPubb (https://github.com/quest-bih/oddpub) text mining algorithm to determine the extent of data sharing and identify the repositories where the data was shared. In addition to the results from ODDPubb, manual analysis was conducted to confirm the presence of data sharing statements, assess data availability, and identify the repositories used.

  14. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  15. Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

    It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

    An example

    An example test sentence:

    Test Sentence:
    {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by 
    American songwriters Gerry Goffin and Carole King."}
    

    An example of ontology:

    Ontology: Music Ontology

    Expected Output:

    {
     "id": "ont_k_music_test_n", 
     "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", 
     "triples": [
     {
      "sub": "The Loco-Motion", 
      "rel": "publication date",
      "obj": "01 January 1962"
     },{
      "sub": "The Loco-Motion",
      "rel": "lyrics by",
      "obj": "Gerry Goffin"
     },{
      "sub": "The Loco-Motion", 
      "rel": "lyrics by", 
      "obj": "Carole King"
     },]
    }
    

    The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

    The structure of the repo is as the following.

    This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

    [1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

    [2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages

  16. NY Times Vector Corpus

    • kaggle.com
    zip
    Updated Nov 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TheItCrow (2023). NY Times Vector Corpus [Dataset]. https://www.kaggle.com/datasets/kevinbnisch/ny-times-vectordb-for-topic-extraction
    Explore at:
    zip(3676887181 bytes)Available download formats
    Dataset updated
    Nov 7, 2023
    Authors
    TheItCrow
    Description

    About

    This is the first version of the English dataset for VecTop which contains >250k (2018-10-01 -> 2023-10-23) articles from NY Times which have been embedded with OpenAI's text-embedding-ada-002. This corpus is being used within VecTop to extract the topics and subtopics of a given text. Please refer to the GitHub page for more information and refer to the live demo here for quick evaluation.

    This dataset is also supplied via a postgreSQL backup. It is advisable to import the dataset into a proper database with Vector functionalities for instance results. See the GitHub Repo for that.

    German Version

    A German version with Spiegel Online has already been released here.

    Use Cases

    Topic Extraction

    Given a small or large chunk of text, it is useful to categorize the text into topics. VecTop uses this dataset within a PostgreSQL database to first summarize the unlabeled text (if determined to be too long) and then create word embeddings of it. These word embeddings are then compared to the dataset, and by doing so, VecTop determines the topics and subtopics by looking at the topics and subtopics of the closest embeddings regarding the cosine similarity. As the result, the text is being categorized into topics and subtopics.

    Searching

    The dataset can be used to search for similarities in texts.

    Legal Research

    Legal VecTop will be used to research legal activities. For that, a legal corpus is being built. (Coming soon)

    License

    VecTop and therefore this dataset is being licensed under the Apache-2.0 license

  17. BenchmarkDP - Text extraction from general documents benchmark dataset

    • figshare.com
    zip
    Updated Jun 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kresimir Duretec; Andreas Rauber; Christoph Becker (2017). BenchmarkDP - Text extraction from general documents benchmark dataset [Dataset]. http://doi.org/10.6084/m9.figshare.4621003.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 21, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Kresimir Duretec; Andreas Rauber; Christoph Becker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This collection contains the benchmark data used for benchmarking text extraction tools. The data conatins : - List of documents - Ground truth data for each document - Additional metadata extrcated with FITS tool Also data contains benchmark results of the following tools - Apache Tika v1.1- Apache Tika v1.2- Apache Tika v1.13- DocToText- XPdf The source code of tools used to produce the dataset and the benchmark results can be found here : https://github.com/kduretec/DataGeneratorAnalysis - R scripts for producing final results

    https://github.com/kduretec/TestDataGenerator - Data and ground truth Generator

    https://github.com/kduretec/ToolEvaluator - Part which evaluates software components and produces results for the R scripts.

  18. POLIcy design ANNotAtions (POLIANNA): Towards understanding policy design...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Sebastian Sewerin; Sebastian Sebastian Sewerin; Lynn Helena Lynn H. Kaack; Lynn Helena Lynn H. Kaack; Joel Küttel; Joel Küttel; Fride Sigurdsson; Onerva Martikainen; Alisha Esshaki; Fabian Hafner; Fride Sigurdsson; Onerva Martikainen; Alisha Esshaki; Fabian Hafner (2023). POLIcy design ANNotAtions (POLIANNA): Towards understanding policy design through text-as-data approaches [Dataset]. http://doi.org/10.5281/zenodo.8284380
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sebastian Sebastian Sewerin; Sebastian Sebastian Sewerin; Lynn Helena Lynn H. Kaack; Lynn Helena Lynn H. Kaack; Joel Küttel; Joel Küttel; Fride Sigurdsson; Onerva Martikainen; Alisha Esshaki; Fabian Hafner; Fride Sigurdsson; Onerva Martikainen; Alisha Esshaki; Fabian Hafner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The POLIANNA dataset is a collection of legislative texts from the European Union (EU) that have been annotated based on theoretical concepts of policy design. The dataset consists of 20,577 annotated spans in 412 articles, drawn from 18 EU climate change mitigation and renewable energy laws, and can be used to develop supervised machine learning approaches for scaling policy analysis. The dataset includes a novel coding scheme for annotating text spans, and you find a description of the annotated corpus, an analysis of inter-annotator agreement, and a discussion of potential applications in the paper accompanying this dataset. The objective of this dataset to build tools that assist with manual coding of policy texts by automatically identifying relevant paragraphs.

    Detailed instructions and further guidance about the dataset as well as all the code used for this project can be found in the accompanying paper and on the GitHub project page. The repository also contains useful code to calculate various inter-annotator agreement measures and can be used to process text annotations generated by INCEpTION.

    Dataset Description

    We provide the dataset in 3 different formats:

    JSON: Each article corresponds to a folder, where the Tokens and Spans are stored in a separate JSON file. Each article-folder further contains the raw policy-text as in a text file and the metadata about the policy. This is the most human-readable format.

    JSONL: Same folder structure as the JSON format, but the Spans and Tokens are stored in a JSONL file, where each line is a valid JSON document.

    Pickle: We provide the dataset as a Python object. This is the recommended method when using our own Python framework that is provided on GitHub. For more information, check out the GitHub project page.


    License

    The POLIANNA dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. If you use the POLIANNA dataset in your research in any form, please cite the dataset.

    Citation

    Sewerin, S., Kaack, L.H., Küttel, J. et al. Towards understanding policy design through text-as-data approaches: The policy design annotations (POLIANNA) dataset. Sci Data10, 896 (2023). https://doi.org/10.1038/s41597-023-02801-z

  19. Data from: TRANSMAT Gold Standard

    • dataverse.cirad.fr
    application/x-gzip +2
    Updated Oct 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Lentschat; Martin Lentschat; Patrice Buche; Patrice Buche; Luc Menut; Luc Menut (2023). TRANSMAT Gold Standard [Dataset]. http://doi.org/10.18167/DVN1/U7HK8J
    Explore at:
    tsv(24301), tsv(382623), tsv(402471), pdf(297359), tsv(31878), application/x-gzip(4980)Available download formats
    Dataset updated
    Oct 25, 2023
    Authors
    Martin Lentschat; Martin Lentschat; Patrice Buche; Patrice Buche; Luc Menut; Luc Menut
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset presents a Gold Standard of data annotated on documents from the Science Direct website. The entities annotated are the ones related to permeability n-Ary relations, as defined in the TRANSMAT Ontology (https://ico.iate.inra.fr/atWeb/, https://doi.org/10.15454/NK24ID, http://agroportal.lirmm.fr/ontologies/TRANSMAT) and following the annotation guide also available here. The annotations were performed by three annotators on a WebAnno (doi: 10.3115/v1/P14-5016) server. The four files present (one per annotator, plus a merged version with priority to annotator 1 in case of conflicts on annotated items) were obtained from the output files of the WebAnno tool. They are presented in table format, without reproducing the full text, for copyright purposes. The information available on each annotation are: Doc (the original document), Target (the generic concept covering the annotated item), Original_Value (the annotated item), Attached_Value (an annotated secondary item for disambiguation), Type (the category of the annotated entity, symbolic, quantitative or additimentionnal) and Annotator (the annotator that performed the annotation). The code of the project for wich this Gold Standard was designed is available here: https://github.com/Eskode/ARTEXT4LOD

  20. D

    CDC Text Corpora for Learners: HTML Mirrors of MMWR, EID, and PCD

    • data.cdc.gov
    • data.virginia.gov
    • +1more
    csv, xlsx, xml
    Updated Mar 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NCSTLTPHIW(PHIC) (2024). CDC Text Corpora for Learners: HTML Mirrors of MMWR, EID, and PCD [Dataset]. https://data.cdc.gov/National-Center-for-State-Tribal-Local-and-Territo/CDC-Text-Corpora-for-Learners-HTML-Mirrors-of-MMWR/ut5n-bmc3
    Explore at:
    csv, xml, xlsxAvailable download formats
    Dataset updated
    Mar 20, 2024
    Dataset authored and provided by
    NCSTLTPHIW(PHIC)
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The attached ZIP archives are part of the CDC Text Corpora for Learners program. This version, comprised of 33,567 articles, was constructed on 2024-03-01 using source content retrieved on 2024-01-09.

    The attached three ZIP archives contain the 33,567 articles in 33,576 compiled HTML mirrors of the MMWR Morbidity and Mortality Weekly Report including its series: Weekly Reports, Recommendations and Reports, Surveillance Summaries, Supplements, and Notifiable Diseases, a subset of Weekly Reports, constructed ad hoc; EID Emerging Infectious Diseases; and PCD Preventing Chronic Disease.There is one archive per series. The archive attachments are located in the About this Dataset section of this landing page. In that section when you click Show More, the attachments are located in the section Attachments.

    The retrieval and organization of the files included making as few changes to raw sources as possible, to support as many downstream uses as possible.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures

Data from: A Neural Approach for Text Extraction from Scholarly Figures

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
TIB
License

Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically

Description

A Neural Approach for Text Extraction from Scholarly Figures

This is the readme for the supplemental data for our ICDAR 2019 paper.

You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

If you found this dataset useful, please consider citing our paper:

@inproceedings{DBLP:conf/icdar/MorrisTE19,
 author  = {David Morris and
        Peichen Tang and
        Ralph Ewerth},
 title   = {A Neural Approach for Text Extraction from Scholarly Figures},
 booktitle = {2019 International Conference on Document Analysis and Recognition,
        {ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
 pages   = {1438--1443},
 publisher = {{IEEE}},
 year   = {2019},
 url    = {https://doi.org/10.1109/ICDAR.2019.00231},
 doi    = {10.1109/ICDAR.2019.00231},
 timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
 biburl  = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
 bibsource = {dblp computer science bibliography, https://dblp.org}
}

This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

Datasets

We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

Testing

These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

Validation

The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

Training

We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

Code

We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.

Search
Clear search
Close search
Google apps
Main menu