97 datasets found

F
Data from: A Neural Approach for Text Extraction from Scholarly Figures
data.uni-hannover.de
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
Explore at:
zipAvailable download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
TIB
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
A Neural Approach for Text Extraction from Scholarly Figures

This is the readme for the supplemental data for our ICDAR 2019 paper.

You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

If you found this dataset useful, please consider citing our paper:

@inproceedings{DBLP:conf/icdar/MorrisTE19, author = {David Morris and Peichen Tang and Ralph Ewerth}, title = {A Neural Approach for Text Extraction from Scholarly Figures}, booktitle = {2019 International Conference on Document Analysis and Recognition, {ICDAR} 2019, Sydney, Australia, September 20-25, 2019}, pages = {1438--1443}, publisher = {{IEEE}}, year = {2019}, url = {https://doi.org/10.1109/ICDAR.2019.00231}, doi = {10.1109/ICDAR.2019.00231}, timestamp = {Tue, 04 Feb 2020 13:28:39 +0100}, biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

Datasets

We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

Testing

These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

Validation

The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

Training

We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

Code

We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.
m
Amharic text dataset extracted from memes for hate speech detection or...
data.mendeley.com
Updated Jun 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mequanent Degu (2023). Amharic text dataset extracted from memes for hate speech detection or classification [Dataset]. http://doi.org/10.17632/gw3fdtw5v7.2
Explore at:
Unique identifier
https://doi.org/10.17632/gw3fdtw5v7.2
Dataset updated
Jun 8, 2023
Authors
Mequanent Degu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
the dataset is collected from social media such as facebook and telegram. the dataset is further processed. the collection are orginal_cleaned: this dataset is neither stemed nor stopword are remove: stopword_removed: in this dataset stopwords are removed but not stemmed and in stemed datset is stemmed and stopwords are removed. stemming is done using hornmorpho developed by Michael Gesser( available at https://github.com/hltdi/HornMorpho) all datasets are normalized and free from noise such as punctuation marks and emojs.
Z
Softcite Dataset: A dataset of software mentions in research publications
data.niaid.nih.gov
zenodo.org
Updated Jan 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Howison; Patrice Lopez; Caifan Du; Hannah Cohoon (2021). Softcite Dataset: A dataset of software mentions in research publications [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4444074
Explore at:
Dataset updated
Jan 17, 2021
Dataset provided by
The University of Texas at Austin
SCIENCE-MINER
Authors
James Howison; Patrice Lopez; Caifan Du; Hannah Cohoon
Description
The Softcite dataset is a gold-standard dataset of software mentions in research publications, a free resource primarily for software entity recognition in scholarly text. This is the first release of this dataset.

What's in the dataset

With the aim of facilitating software entity recognition efforts at scale and eventually increased visibility of research software for the due credit of software contributions to scholarly research, a team of trained annotators from Howison Lab at the University of Texas at Austin annotated 4,093 software mentions in 4,971 open access research publications in biomedicine (from PubMed Central Open Access collection) and economics (from Unpaywall open access services). The annotated software mentions, along with their publisher, version, and access URL, if mentioned in the text, as well as those publications annotated as containing no software mentions, are all included in the released dataset as a TEI/XML corpus file.

For understanding the schema of the Softcite corpus, its design considerations, and provenance, please refer to our paper included in this release (preprint version).

Use scenarios

The release of the Softcite dataset is intended to encourage researchers and stakeholders to make research software more visible in science, especially to academic databases and systems of information retrieval; and facilitate interoperability and collaboration among similar and relevant efforts in software entity recognition and building utilities for software information retrieval. This dataset can also be useful for researchers investigating software use in academic research.

Current release content

softcite-dataset v1.0 release includes:

The Softcite dataset corpus file: softcite_corpus-full.tei.xml

Softcite Dataset: A Dataset of Software Mentions in Biomedical and Economic Research Publications, our paper that describes the design consideration and creation process of the dataset: Softcite_Dataset_Description_RC.pdf. (This is a preprint version of our forthcoming publication in the Journal of the Association for Information Science and Technology.)

The Softcite dataset is licensed under a Creative Commons Attribution 4.0 International License.

If you have questions, please start a discussion or issue in the howisonlab/softcite-dataset Github repository.
Examining the Capacity of Text Mining and Software Metrics in Vulnerability...
data.europa.eu
unknown
Updated Sep 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2023). Examining the Capacity of Text Mining and Software Metrics in Vulnerability Prediction [dataset] [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-8369963?locale=fr
Explore at:
unknown(79359120)Available download formats
Dataset updated
Sep 21, 2023
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the extension of a publicly available dataset that was published initially by Ferenc et al. in their paper: “Ferenc, R.; Hegedus, P.; Gyimesi, P.; Antal, G.; Bán, D.; Gyimóthy, T. Challenging machine learning algorithms in predicting vulnerable javascript functions. 2019 IEEE/ACM 7th InternationalWorkshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE). IEEE, 2019, pp. 8–14.” The dataset contained software metrics for source code functions written in JavaScript (JS) programming language. Each function was labeled as vulnerable or clean. The authors gathered vulnerabilities from publicly available vulnerability databases. In our paper entitled: “Examining the Capacity of Text Mining and Software Metrics in Vulnerability Prediction” and cited as: “Kalouptsoglou I, Siavvas M, Kehagias D, Chatzigeorgiou A, Ampatzoglou A. Examining the Capacity of Text Mining and Software Metrics in Vulnerability Prediction. Entropy. 2022; 24(5):651. https://doi.org/10.3390/e24050651” , we presented an extended version of the dataset by extracting textual features for the labeled JS functions. In particular, we got the dataset provided by Ferenc et al. in CSV format and then we gathered all the GitHub URLs of the dataset's functions (i.e., methods). Using these URLs, we collected the source code of the corresponding JS files from GitHub. Subsequently, by utilizing the start and end line information for every function, we cut off the code of the functions. Each function was then tokenized to construct a list of tokens per function. To extract text features, we used a text mining technique called sequences of tokens. As a result, we created a repository with all methods' source code, the token sequences of each method, and their labels. To boost the generalizability of type-specific tokens, all comments were eliminated, as well as all integers and strings, which were replaced with two unique IDs. The dataset contains 12,106 JavaScript functions, from which 1,493 are considered vulnerable. This dataset was created and utilized during the Vulnerability Prediction Task of the Horizon2020 IoTAC Project as training and evaluation data for the construction of vulnerability prediction models. The dataset is provided in the csv format. Each row of the csv file has the following parts: Label: Flag with values ‘1’ for vulnerable and ‘0’ for non-vulnerable methods Name: The name of the JavaScript method Longname: The longname of the JavaScript method Path: The path of the file of the method in the repository Full_repo_path: The GitHub URL of the file of the method TokenX: Each next row corresponds to each token included in the method
GitHub Public Pull Request Comments
kaggle.com
zip
Updated Sep 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter (2023). GitHub Public Pull Request Comments [Dataset]. https://www.kaggle.com/datasets/pelmers/github-public-pull-request-comments/code
Explore at:
zip(3307386118 bytes)Available download formats
Dataset updated
Sep 6, 2023
Authors
Peter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset used for the master's thesis "LLMs for Code Comment Consistency." Covers the languages Go, Java, JavaScript, TypeScript, and Python. All data is mined from permissively-licensed GitHub public projects with at least 25 stars and 25 pull requests submitted at the time of access, based on the GitHub Public Repository Metadata Dataset.

This dataset pertains specifically to pull request comments that are made on files. In other words, every comment in this dataset is linked to a specific file in a pull request.

What can I do with this data?

Anything you want, of course, but here are some starter ideas: - Sentiment analysis of comments, is there a correlation between number of contributions and positivity of reviews? - Pull request comment generation: can we automatically make code review comments? - PR text mining: can we mine out examples of a specific type of comment? (in my project, this was comments about function documentation)

The mining code is publicly accessible at https://github.com/pelmers/llms-for-code-comment-consistency/tree/main/rq3

Each file is a JSON object where each key is a Github repository, and each value is a pull request comment in that repository. Example: { "trekhleb/javascript-algorithms": [{ "html_url": "https://github.com/trekhleb/javascript-algorithms/pull/101#discussion_r204437121", "path": "src/algorithms/string/knuth-morris-pratt/knuthMorrisPratt.js", "line": 33, "body": "Please take a look at the comments to the tests above. No need to do this checking.", "user": "trekhleb", "diff_hunk": "@@ -30,6 +30,10 @@ function buildPatternTable(word) { * @return {number} */ export default function knuthMorrisPratt(text, word) { + if (word.length === 0) {", "author_association": "OWNER", "commit_id": "618d0962025ff1116979560a0bfa0ed1660f129e", "id": 204437121, "repo": "trekhleb/javascript-algorithms" }, ...] }
Friends - R Package Dataset
kaggle.com
zip
Updated Nov 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Yukio Imafuko (2024). Friends - R Package Dataset [Dataset]. https://www.kaggle.com/datasets/lucasyukioimafuko/friends-r-package-dataset
Explore at:
zip(2018791 bytes)Available download formats
Dataset updated
Nov 11, 2024
Authors
Lucas Yukio Imafuko
Description
The whole data and source can be found at https://emilhvitfeldt.github.io/friends/

"The goal of friends to provide the complete script transcription of the Friends sitcom. The data originates from the Character Mining repository which includes references to scientific explorations using this data. This package simply provides the data in tibble format instead of json files."

Content

friends.csv - Contains the scenes and lines for each character, including season and episodes.

friends_emotions.csv - Contains sentiments for each scene - for the first four seasons only.

friends_info.csv - Contains information regarding each episode, such as imdb_rating, views, episode title and directors.

Uses

Text mining, sentiment analysis and word statistics.

Data visualizations.
m
Austin_Survey_for_MDCOR_Analyses
data.mendeley.com
Updated Nov 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuel Gonzalez Canche (2022). Austin_Survey_for_MDCOR_Analyses [Dataset]. http://doi.org/10.17632/nb7yvhjvzk.1
Explore at:
Unique identifier
https://doi.org/10.17632/nb7yvhjvzk.1
Dataset updated
Nov 14, 2022
Authors
Manuel Gonzalez Canche
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Austin
Description
The city of Austin has administered a community survey for the 2015, 2016, 2017, 2018 and 2019 years (https://data.austintexas.gov/City-Government/Community-Survey/s2py-ceb7), to “assess satisfaction with the delivery of the major City Services and to help determine priorities for the community as part of the City’s ongoing planning process.” To directly access this dataset from the city of Austin’s website, you can follow this link https://cutt.ly/VNqq5Kd. Although we downloaded the dataset analyzed in this study from the former link, given that the city of Austin is interested in continuing administering this survey, there is a chance that the data we used for this analysis and the data hosted in the city of Austin’s website may differ in the following years. Accordingly, to ensure the replication of our findings, we recommend researchers to download and analyze the dataset we employed in our analyses, which can be accessed at the following link https://github.com/democratizing-data-science/MDCOR/blob/main/Community_Survey.csv. Replication Features or Variables The community survey data has 10,684 rows and 251 columns. Of these columns, our analyses will rely on the following three indicators that are taken verbatim from the survey: “ID”, “Q25 - If there was one thing you could share with the Mayor regarding the City of Austin (any comment, suggestion, etc.), what would it be?", and “Do you own or rent your home?”
GitHub Public Pull Request Comments
zenodo.org
zip
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2023). GitHub Public Pull Request Comments [Dataset]. http://doi.org/10.5281/zenodo.10138317
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10138317
Dataset updated
Nov 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over 13 MILLION pull request comments
Dataset used for the master's thesis "LLMs for Code Comment Consistency." Covers the languages Go, Java, JavaScript, TypeScripp, and Python. All data is mined from permissively-licensed GitHub public projects with at least 25 stars and 25 pull requests submitted at the time of access.

This dataset pertains specifically to **pull request comments that are made on files.** In other words, every comment in this dataset is linked to a specific file in a pull request.

### What can I do with this data?
Anything you want, of course, but here are some starter ideas:
- Sentiment analysis of comments, is there a correlation between number of contributions and positivity of reviews?
- Pull request comment generation: can we automatically make code review comments?
- PR text mining: can we mine out examples of a specific type of comment? (in my project, this was comments about function documentation)

The mining code is publicly accessible.

Each file is a JSON object where each key is a Github repository, and each value is a pull request comment in that repository.
Steam Dataset 2025: Multi-Modal Gaming Analytics
kaggle.com
zip
Updated Oct 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CrainBramp (2025). Steam Dataset 2025: Multi-Modal Gaming Analytics [Dataset]. https://www.kaggle.com/datasets/crainbramp/steam-dataset-2025-multi-modal-gaming-analytics
Explore at:
zip(12478964226 bytes)Available download formats
Dataset updated
Oct 7, 2025
Authors
CrainBramp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Steam Dataset 2025: Multi-Modal Gaming Analytics Platform

The first multi-modal Steam dataset with semantic search capabilities. 239,664 applications collected from official Steam Web APIs with PostgreSQL database architecture, vector embeddings for content discovery, and comprehensive review analytics.

Made by a lifelong gamer for the gamer in all of us. Enjoy!🎮

GitHub Repository https://github.com/vintagedon/steam-dataset-2025

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F4b7eb73ac0f2c3cc9f0d57f37321b38f%2FScreenshot%202025-10-18%20180450.png?generation=1760825194507387&alt=media" alt=""> 1024-dimensional game embeddings projected to 2D via UMAP reveal natural genre clustering in semantic space

What Makes This Different

Unlike traditional flat-file Steam datasets, this is built as an analytically-native database optimized for advanced data science workflows:

☑️ Semantic Search Ready - 1024-dimensional BGE-M3 embeddings enable content-based game discovery beyond keyword matching

☑️ Multi-Modal Architecture - PostgreSQL + JSONB + pgvector in unified database structure

☑️ Production Scale - 239K applications vs typical 6K-27K in existing datasets

☑️ Complete Review Corpus - 1,048,148 user reviews with sentiment and metadata

☑️ 28-Year Coverage - Platform evolution from 1997-2025

☑️ Publisher Networks - Developer and publisher relationship data for graph analysis

☑️ Complete Methodology & Infrastructure - Full work logs document every technical decision and challenge encountered, while my API collection scripts, database schemas, and processing pipelines enable you to update the dataset, fork it for customized analysis, learn from real-world data engineering workflows, or critique and improve the methodology

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F649e9f7f46c6ce213101d0948c89e8ac%2F4_price_distribution_by_top_10_genres.png?generation=1760824835918620&alt=media" alt=""> Market segmentation and pricing strategy analysis across top 10 genres

What's Included

Core Data (CSV Exports): - 239,664 Steam applications with complete metadata - 1,048,148 user reviews with scores and statistics - 13 normalized relational tables for pandas/SQL workflows - Genre classifications, pricing history, platform support - Hardware requirements (min/recommended specs) - Developer and publisher portfolios

Advanced Features (PostgreSQL): - Full database dump with optimized indexes - JSONB storage preserving complete API responses - Materialized columns for sub-second query performance - Vector embeddings table (pgvector-ready)

Documentation: - Complete data dictionary with field specifications - Database schema documentation - Collection methodology and validation reports

Example Analysis: Published Notebooks (v1.0)

Three comprehensive analysis notebooks demonstrate dataset capabilities. All notebooks render directly on GitHub with full visualizations and output:

📊 Platform Evolution & Market Landscape

View on GitHub | PDF Export
28 years of Steam's growth, genre evolution, and pricing strategies.

🔍 Semantic Game Discovery

View on GitHub | PDF Export
Content-based recommendations using vector embeddings across genre boundaries.

🎯 The Semantic Fingerprint

View on GitHub | PDF Export
Genre prediction from game descriptions - demonstrates text analysis capabilities.

Notebooks render with full output on GitHub. Kaggle-native versions planned for v1.1 release. CSV data exports included in dataset for immediate analysis.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F4079e43559d0068af00a48e2c31f0f1d%2FScreenshot%202025-10-18%20180214.png?generation=1760824950649726&alt=media" alt=""> *Steam platfor...
U
Replication Data for: A Review of Best Practice Recommendations for...
dataverse-staging.rdmc.unc.edu
datasearch.gesis.org
Updated Nov 7, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Wesslen; Ryan Wesslen (2017). Replication Data for: A Review of Best Practice Recommendations for Text-Analysis in R (and a User Friendly App) [Dataset]. http://doi.org/10.15139/S3/R4W7ZS
Explore at:
csv(1070619), application/x-rlang-transport(1014184), pdf(76215), text/x-r-markdown(14242), text/x-r-markdown(12162), html(2930583), application/x-rlang-transport(2108553), docx(24677), html(2442743), html(1689406), text/markdown(1958), application/x-rlang-transport(1623238), text/x-r-markdown(12252)Available download formats
Unique identifier
https://doi.org/10.15139/S3/R4W7ZS
Dataset updated
Nov 7, 2017
Dataset provided by
UNC Dataverse
Authors
Ryan Wesslen; Ryan Wesslen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Replication materials for "A Review of Best Practice Recommendations for Text-Analysis in R (and a User Friendly App)". You can also find these materials on GitHub repo (https://github.com/wesslen/text-analysis-org-science) as well as the Shiny app in the GitHub repo (https://github.com/wesslen/topicApp).
w
Nanopubs extracted from DisGeNET v3.0.0.0
data.wu.ac.at
rdf, trig gzip
Updated Nov 24, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nanopublications (2015). Nanopubs extracted from DisGeNET v3.0.0.0 [Dataset]. https://data.wu.ac.at/odso/datahub_io/ZmM5NDU5Y2MtYTVhMy00MTc4LTg5ZDYtMWZjNWE5MDFmODI4
Explore at:
rdf, trig gzipAvailable download formats
Dataset updated
Nov 24, 2015
Dataset provided by
Nanopublications
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
1018735 nanopublications. These nanopubs were automatically extracted from the DisGeNET dataset. See also the main DisGeNET data on Datahub at https://datahub.io/dataset/disgenet.

Download the content of this set of nanopublications from the server network using nanopub-java at https://github.com/Nanopublication/nanopub-java:

$ np get -c -o nanopubs.trig RAVEKRW0m6Ly_PjmhcxCZMR5fYIlzzqjOWt1CgcwD_77c
Z
Ontology Enrichment from Texts (OET): A Biomedical Dataset for Concept...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Dec 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dong, Hang; Chen, Jiaoyan; He, Yuan; Horrocks, Ian (2023). Ontology Enrichment from Texts (OET): A Biomedical Dataset for Concept Discovery and Placement [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8043689
Explore at:
Dataset updated
Dec 26, 2023
Dataset provided by
University of Manchester
University of Oxford
Authors
Dong, Hang; Chen, Jiaoyan; He, Yuan; Horrocks, Ian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A biomedical dataset supporting ontology enrichment from texts, by concept discovery and placement, adapting the MedMentions dataset (PubMed abstracts) with SNOMED CT of versions in 2014 and 2017 under the Diseases (disorder) sub-category and the broader categories of Clinical finding, Procedure, and Pharmaceutical / biologic (CPP) product.

The dataset is documented in the work, Ontology Enrichment from Texts: A Biomedical Dataset for Concept Discovery and Placement, on arXiv: https://arxiv.org/abs/2306.14704 (CIKM 2023). The companion code is available at https://github.com/KRR-Oxford/OET.

Out-of-KB mention discovery (including the settings of mention-level data) is further partly documented in the work, Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking, on arXiv: https://arxiv.org/abs/2302.07189 (CIKM 2023).

ver4: we made a version of mention-level data for out-of-KB discovery and concept placement separately: the former (for out-of-KB discovery) has out-of-KB mentions in training data, while the latter (for concept placement) has only out-of-KB mentions during the evaluation (validation and test) and not in the training data. Also, we split the original "test-NIL.jsonl" (now "test-NIL-all.jsonl") into "valid-NIL.jsonl" and "test-NIL.jsonl" for a better evaluation.

ver3: we revised and updated mention-level data (syn_full, synonym augmentation setting) and the folder structure, and also updated the edge catalogues with complex edges.

ver2: we revised the mention-level data by only keeping out-of-KB mentions (or "NIL" mentions) associated with one-hop edges (including leaf nodes, as ) and two-hop edges in the ontology (SNOMED CT 20140901).

Acknowledgement of data sources and tools below:

SNOMED CT https://www.nlm.nih.gov/healthit/snomedct/archive.html (and use snomed-owl-toolkit to form .owl files)* UMLS https://www.nlm.nih.gov/research/umls/licensedcontent/umlsarchives04.html (and mainly use MRCONSO for mapping UMLS to SNOMED CT)* MedMentions https://github.com/chanzuckerberg/MedMentions (source of entity linking)

Protégé http://protegeproject.github.io/protege/* snomed-owl-toolkit https://github.com/IHTSDO/snomed-owl-toolkit* DeepOnto https://github.com/KRR-Oxford/DeepOnto (based on OWLAPI https://owlapi.sourceforge.net/) for ontology processing and complex concept verbalisation
r
Dataset for "Do LiU researchers publish data – and where? Dataset analysis...
researchdata.se
demo.researchdata.se
+1more
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaori Hoshi Larsson (2025). Dataset for "Do LiU researchers publish data – and where? Dataset analysis using ODDPub" [Dataset]. http://doi.org/10.5281/zenodo.15017715
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15017715
Dataset updated
Mar 19, 2025
Dataset provided by
Linköping University
Authors
Kaori Hoshi Larsson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the results from the ODDPubb text mining algorithm and the findings from manual analysis. Full-text PDFs of all articles parallel-published by Linköping University in 2022 were extracted from the institute's repository, DiVA. These were analyzed using the ODDPubb (https://github.com/quest-bih/oddpub) text mining algorithm to determine the extent of data sharing and identify the repositories where the data was shared. In addition to the results from ODDPubb, manual analysis was conducted to confirm the presence of data sharing statements, assess data availability, and identify the repositories used.
l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...
zenodo.org
data.niaid.nih.gov
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7916716
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence: {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

An example of ontology:

Ontology: Music Ontology

Expected Output:

{ "id": "ont_k_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

Text2KGBench

src: the source code used for generation and evaluation, and baseline

benchmark the code used to generate the benchmark

evaluation evaluation scripts for calculating the results

baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

wikidata_tekgen Wikidata-TekGen Dataset

ontologies 10 ontologies used by this dataset

train training data

test test data

manually_verified_sentences ids of a subset of test cases manually validated

unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

test unseen test unseen test sentences

ground_truth ground truth for unseen test sentences.

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

unseen prompts unseen prompts for the unseen test cases

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

unseen results results for the unseen test cases

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

dbpedia_webnlg DBpedia Dataset

ontologies 19 ontologies used by this dataset

train training data

test test data

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
NY Times Vector Corpus
kaggle.com
zip
Updated Nov 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TheItCrow (2023). NY Times Vector Corpus [Dataset]. https://www.kaggle.com/datasets/kevinbnisch/ny-times-vectordb-for-topic-extraction
Explore at:
zip(3676887181 bytes)Available download formats
Dataset updated
Nov 7, 2023
Authors
TheItCrow
Description
About

This is the first version of the English dataset for VecTop which contains >250k (2018-10-01 -> 2023-10-23) articles from NY Times which have been embedded with OpenAI's text-embedding-ada-002. This corpus is being used within VecTop to extract the topics and subtopics of a given text. Please refer to the GitHub page for more information and refer to the live demo here for quick evaluation.

This dataset is also supplied via a postgreSQL backup. It is advisable to import the dataset into a proper database with Vector functionalities for instance results. See the GitHub Repo for that.

German Version

A German version with Spiegel Online has already been released here.

Use Cases

Topic Extraction

Given a small or large chunk of text, it is useful to categorize the text into topics. VecTop uses this dataset within a PostgreSQL database to first summarize the unlabeled text (if determined to be too long) and then create word embeddings of it. These word embeddings are then compared to the dataset, and by doing so, VecTop determines the topics and subtopics by looking at the topics and subtopics of the closest embeddings regarding the cosine similarity. As the result, the text is being categorized into topics and subtopics.

Searching

The dataset can be used to search for similarities in texts.

Legal Research

Legal VecTop will be used to research legal activities. For that, a legal corpus is being built. (Coming soon)

License

VecTop and therefore this dataset is being licensed under the Apache-2.0 license
BenchmarkDP - Text extraction from general documents benchmark dataset
figshare.com
zip
Updated Jun 21, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kresimir Duretec; Andreas Rauber; Christoph Becker (2017). BenchmarkDP - Text extraction from general documents benchmark dataset [Dataset]. http://doi.org/10.6084/m9.figshare.4621003.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4621003.v1
Dataset updated
Jun 21, 2017
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Kresimir Duretec; Andreas Rauber; Christoph Becker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This collection contains the benchmark data used for benchmarking text extraction tools. The data conatins : - List of documents - Ground truth data for each document - Additional metadata extrcated with FITS tool Also data contains benchmark results of the following tools - Apache Tika v1.1- Apache Tika v1.2- Apache Tika v1.13- DocToText- XPdf The source code of tools used to produce the dataset and the benchmark results can be found here : https://github.com/kduretec/DataGeneratorAnalysis - R scripts for producing final results

https://github.com/kduretec/TestDataGenerator - Data and ground truth Generator

https://github.com/kduretec/ToolEvaluator - Part which evaluates software components and produces results for the R scripts.
POLIcy design ANNotAtions (POLIANNA): Towards understanding policy design...
zenodo.org
data.niaid.nih.gov
zip
Updated Dec 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Sebastian Sewerin; Sebastian Sebastian Sewerin; Lynn Helena Lynn H. Kaack; Lynn Helena Lynn H. Kaack; Joel Küttel; Joel Küttel; Fride Sigurdsson; Onerva Martikainen; Alisha Esshaki; Fabian Hafner; Fride Sigurdsson; Onerva Martikainen; Alisha Esshaki; Fabian Hafner (2023). POLIcy design ANNotAtions (POLIANNA): Towards understanding policy design through text-as-data approaches [Dataset]. http://doi.org/10.5281/zenodo.8284380
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8284380
Dataset updated
Dec 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sebastian Sebastian Sewerin; Sebastian Sebastian Sewerin; Lynn Helena Lynn H. Kaack; Lynn Helena Lynn H. Kaack; Joel Küttel; Joel Küttel; Fride Sigurdsson; Onerva Martikainen; Alisha Esshaki; Fabian Hafner; Fride Sigurdsson; Onerva Martikainen; Alisha Esshaki; Fabian Hafner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The POLIANNA dataset is a collection of legislative texts from the European Union (EU) that have been annotated based on theoretical concepts of policy design. The dataset consists of 20,577 annotated spans in 412 articles, drawn from 18 EU climate change mitigation and renewable energy laws, and can be used to develop supervised machine learning approaches for scaling policy analysis. The dataset includes a novel coding scheme for annotating text spans, and you find a description of the annotated corpus, an analysis of inter-annotator agreement, and a discussion of potential applications in the paper accompanying this dataset. The objective of this dataset to build tools that assist with manual coding of policy texts by automatically identifying relevant paragraphs.

Detailed instructions and further guidance about the dataset as well as all the code used for this project can be found in the accompanying paper and on the GitHub project page. The repository also contains useful code to calculate various inter-annotator agreement measures and can be used to process text annotations generated by INCEpTION.

Dataset Description

We provide the dataset in 3 different formats:

JSON: Each article corresponds to a folder, where the Tokens and Spans are stored in a separate JSON file. Each article-folder further contains the raw policy-text as in a text file and the metadata about the policy. This is the most human-readable format.

JSONL: Same folder structure as the JSON format, but the Spans and Tokens are stored in a JSONL file, where each line is a valid JSON document.

Pickle: We provide the dataset as a Python object. This is the recommended method when using our own Python framework that is provided on GitHub. For more information, check out the GitHub project page.

License

The POLIANNA dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. If you use the POLIANNA dataset in your research in any form, please cite the dataset.

Citation

Sewerin, S., Kaack, L.H., Küttel, J. et al. Towards understanding policy design through text-as-data approaches: The policy design annotations (POLIANNA) dataset. Sci Data10, 896 (2023). https://doi.org/10.1038/s41597-023-02801-z
Data from: TRANSMAT Gold Standard
dataverse.cirad.fr
application/x-gzip +2
Updated Oct 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Lentschat; Martin Lentschat; Patrice Buche; Patrice Buche; Luc Menut; Luc Menut (2023). TRANSMAT Gold Standard [Dataset]. http://doi.org/10.18167/DVN1/U7HK8J
Explore at:
tsv(24301), tsv(382623), tsv(402471), pdf(297359), tsv(31878), application/x-gzip(4980)Available download formats
Unique identifier
https://doi.org/10.18167/DVN1/U7HK8J
Dataset updated
Oct 25, 2023
Dataset provided by
Centre de coopération internationale en recherche agronomique pour le développementhttps://www.cirad.fr/
Authors
Martin Lentschat; Martin Lentschat; Patrice Buche; Patrice Buche; Luc Menut; Luc Menut
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset presents a Gold Standard of data annotated on documents from the Science Direct website. The entities annotated are the ones related to permeability n-Ary relations, as defined in the TRANSMAT Ontology (https://ico.iate.inra.fr/atWeb/, https://doi.org/10.15454/NK24ID, http://agroportal.lirmm.fr/ontologies/TRANSMAT) and following the annotation guide also available here. The annotations were performed by three annotators on a WebAnno (doi: 10.3115/v1/P14-5016) server. The four files present (one per annotator, plus a merged version with priority to annotator 1 in case of conflicts on annotated items) were obtained from the output files of the WebAnno tool. They are presented in table format, without reproducing the full text, for copyright purposes. The information available on each annotation are: Doc (the original document), Target (the generic concept covering the annotated item), Original_Value (the annotated item), Attached_Value (an annotated secondary item for disambiguation), Type (the category of the annotated entity, symbolic, quantitative or additimentionnal) and Annotator (the annotator that performed the annotation). The code of the project for wich this Gold Standard was designed is available here: https://github.com/Eskode/ARTEXT4LOD
D
CDC Text Corpora for Learners: HTML Mirrors of MMWR, EID, and PCD
data.cdc.gov
data.virginia.gov
+1more
csv, xlsx, xml
Updated Mar 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NCSTLTPHIW(PHIC) (2024). CDC Text Corpora for Learners: HTML Mirrors of MMWR, EID, and PCD [Dataset]. https://data.cdc.gov/National-Center-for-State-Tribal-Local-and-Territo/CDC-Text-Corpora-for-Learners-HTML-Mirrors-of-MMWR/ut5n-bmc3
Explore at:
csv, xml, xlsxAvailable download formats
Dataset updated
Mar 20, 2024
Dataset authored and provided by
NCSTLTPHIW(PHIC)
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The attached ZIP archives are part of the CDC Text Corpora for Learners program. This version, comprised of 33,567 articles, was constructed on 2024-03-01 using source content retrieved on 2024-01-09.

The attached three ZIP archives contain the 33,567 articles in 33,576 compiled HTML mirrors of the MMWR Morbidity and Mortality Weekly Report including its series: Weekly Reports, Recommendations and Reports, Surveillance Summaries, Supplements, and Notifiable Diseases, a subset of Weekly Reports, constructed ad hoc; EID Emerging Infectious Diseases; and PCD Preventing Chronic Disease.There is one archive per series. The archive attachments are located in the About this Dataset section of this landing page. In that section when you click Show More, the attachments are located in the section Attachments.

The retrieval and organization of the files included making as few changes to raw sources as possible, to support as many downstream uses as possible.

Facebook

Twitter

Click to copy link

Link copied

Cite

TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures

Data from: A Neural Approach for Text Extraction from Scholarly Figures

Explore at:

zipAvailable download formats

Dataset updated

Jan 20, 2022

Dataset authored and provided by

TIB

License

Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically

Description

A Neural Approach for Text Extraction from Scholarly Figures

This is the readme for the supplemental data for our ICDAR 2019 paper.

You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

If you found this dataset useful, please consider citing our paper:

@inproceedings{DBLP:conf/icdar/MorrisTE19,
 author  = {David Morris and
        Peichen Tang and
        Ralph Ewerth},
 title   = {A Neural Approach for Text Extraction from Scholarly Figures},
 booktitle = {2019 International Conference on Document Analysis and Recognition,
        {ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
 pages   = {1438--1443},
 publisher = {{IEEE}},
 year   = {2019},
 url    = {https://doi.org/10.1109/ICDAR.2019.00231},
 doi    = {10.1109/ICDAR.2019.00231},
 timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
 biburl  = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
 bibsource = {dblp computer science bibliography, https://dblp.org}
}

This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

Datasets

We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

Testing

These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

Validation

The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

Training

We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

Code

We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.

Clear search

Close search

Google apps

Main menu

Data from: A Neural Approach for Text Extraction from Scholarly Figures

A Neural Approach for Text Extraction from Scholarly Figures

Datasets

Testing

Validation

Training

Code

Amharic text dataset extracted from memes for hate speech detection or...

Softcite Dataset: A dataset of software mentions in research publications

Examining the Capacity of Text Mining and Software Metrics in Vulnerability...

GitHub Public Pull Request Comments

What can I do with this data?

Friends - R Package Dataset

Content

Uses

Austin_Survey_for_MDCOR_Analyses

GitHub Public Pull Request Comments

Steam Dataset 2025: Multi-Modal Gaming Analytics

Steam Dataset 2025: Multi-Modal Gaming Analytics Platform

What Makes This Different

What's Included

Example Analysis: Published Notebooks (v1.0)

📊 Platform Evolution & Market Landscape

🔍 Semantic Game Discovery

🎯 The Semantic Fingerprint

Replication Data for: A Review of Best Practice Recommendations for...

Nanopubs extracted from DisGeNET v3.0.0.0

Ontology Enrichment from Texts (OET): A Biomedical Dataset for Concept...

Dataset for "Do LiU researchers publish data – and where? Dataset analysis...

LScD (Leicester Scientific Dictionary)

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

NY Times Vector Corpus

About

German Version

Use Cases

Topic Extraction

Searching

Legal Research

License

BenchmarkDP - Text extraction from general documents benchmark dataset

POLIcy design ANNotAtions (POLIANNA): Towards understanding policy design...

Data from: TRANSMAT Gold Standard

CDC Text Corpora for Learners: HTML Mirrors of MMWR, EID, and PCD

Data from: A Neural Approach for Text Extraction from Scholarly Figures

A Neural Approach for Text Extraction from Scholarly Figures

Datasets

Testing

Validation

Training

Code