Facebook
TwitterTackling Hallucinations in Neural Chart Summarization
Introduction
The trained model for investigations and state-of-the-art (SOTA) improvements are detailed in the paper: Tackling Hallucinations in Neural Chart Summarization. This repo contains optimized input prompts and summaries after NLI-filtering.
Abstract
Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we address the problem of… See the full description on the dataset page: https://huggingface.co/datasets/saadob12/chart-to-text.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Chart Text Detection is a dataset for object detection tasks - it contains Text annotations for 6,399 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
There are two files in the data file, one of which is all valid comment text data used by the paper, with a total of 297,774 pieces; the other is the data required for drawing the main graphs in the paper.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Overview
This dataset covers the encoder embeddings and prediction results of LLMs of paper 'Model Generalization on Text Attribute Graphs: Principles with Lagre Language Models', Haoyu Wang, Shikun Liu, Rongzhe Wei, Pan Li.
Dataset Description
The dataset structure should be organized as follows: /dataset/ │── [dataset_name]/ │ │── processed_data.pt # Contains labels and graph information │ │── [encoder]_x.pt # Features extracted by different encoders │… See the full description on the dataset page: https://huggingface.co/datasets/Graph-COM/Text-Attributed-Graphs.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Currently, in the field of chart datasets, most existing resources are mainly in English, and there are almost no open-source Chinese chart datasets, which brings certain limitations to research and applications related to Chinese charts. This dataset draws on the construction method of the DVQA dataset to create a chart dataset focused on the Chinese environment. To ensure the authenticity and practicality of the dataset, we first referred to the authoritative website of the National Bureau of Statistics and selected 24 widely used data label categories in practical applications, totaling 262 specific labels. These tag categories cover multiple important areas such as socio-economic, demographic, and industrial development. In addition, in order to further enhance the diversity and practicality of the dataset, this paper sets 10 different numerical dimensions. These numerical dimensions not only provide a rich range of values, but also include multiple types of values, which can simulate various data distributions and changes that may be encountered in real application scenarios. This dataset has carefully designed various types of Chinese bar charts to cover various situations that may be encountered in practical applications. Specifically, the dataset not only includes conventional vertical and horizontal bar charts, but also introduces more challenging stacked bar charts to test the performance of the method on charts of different complexities. In addition, to further increase the diversity and practicality of the dataset, the text sets diverse attribute labels for each chart type. These attribute labels include but are not limited to whether they have data labels, whether the text is rotated 45 °, 90 °, etc. The addition of these details makes the dataset more realistic for real-world application scenarios, while also placing higher demands on data extraction methods. In addition to the charts themselves, the dataset also provides corresponding data tables and title text for each chart, which is crucial for understanding the content of the chart and verifying the accuracy of the extracted results. This dataset selects Matplotlib, the most popular and widely used data visualization library in the Python programming language, to be responsible for generating chart images required for research. Matplotlib has become the preferred tool for data scientists and researchers in data visualization tasks due to its rich features, flexible configuration options, and excellent compatibility. By utilizing the Matplotlib library, every detail of the chart can be precisely controlled, from the drawing of data points to the annotation of coordinate axes, from the addition of legends to the setting of titles, ensuring that the generated chart images not only meet the research needs, but also have high readability and attractiveness visually. The dataset consists of 58712 pairs of Chinese bar charts and corresponding data tables, divided into training, validation, and testing sets in a 7:2:1 ratio.
Facebook
TwitterThe data this week comes from Data.World by way of Sean Miller, Billboard.com and Spotify.
Billboard Top 100 - Wikipedia
The Billboard Hot 100 is the music industry standard record chart in the United States for songs, published weekly by Billboard magazine. Chart rankings are based on sales (physical and digital), radio play, and online streaming in the United States.
Billboard Top 100 Article
Drake rewrites the record for the most entries ever on the Billboard Hot 100, as he lands his 208th career title on the latest list, dated March 21
billboard.csv| variable | class | description |
|---|---|---|
| url | character | Billboard Chart URL |
| week_id | character | Week ID |
| week_position | double | Week position 1: 100 |
| song | character | Song name |
| performer | character | Performer name |
| song_id | character | Song ID, combo of song/singer |
| instance | double | Instance (this is used to separate breaks on the chart for a given song. Example, an instance of 6 tells you that this is the sixth time this song has appeared on the chart) |
| previous_week_position | double | Previous week position |
| peak_position | double | Peak position as of that week |
| weeks_on_chart | double | Weeks on chart as of that week |
audio_features.csv| variable | class | description |
|---|---|---|
| song_id | character | Song ID |
| performer | character | Performer name |
| song | character | Song |
| spotify_genre | character | Genre |
| spotify_track_id | character | Track ID |
| spotify_track_preview_url | character | Spotify URL |
| spotify_track_duration_ms | double | Duration in ms |
| spotify_track_explicit | logical | Is explicit |
| spotify_track_album | character | Album name |
| danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | double | Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that t... |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open Text reported $12.76B in Market Capitalization this December of 2025, considering the latest stock price and the number of outstanding shares.Data for Open Text | OTC - Market Capitalization including historical, tables and charts were last updated by Trading Economics this last December in 2025.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset was created by Sanjana Murthy
Released under CC BY-NC-SA 4.0
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open Text reported $13.48B in Assets for its fiscal quarter ending in September of 2025. Data for Open Text | OTC - Assets including historical, tables and charts were last updated by Trading Economics this last December in 2025.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of words and number of tokens denote the number of words in the dataset before and after preprocessing respectively. Direct edges and indirect edges represent the number of direct and indirect synonym relationships between words in the text graph respectively.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.
It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.
An example
An example test sentence:
Test Sentence:
{"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by
American songwriters Gerry Goffin and Carole King."}
An example of ontology:
Ontology: Music Ontology
Expected Output:
{
"id": "ont_k_music_test_n",
"sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.",
"triples": [
{
"sub": "The Loco-Motion",
"rel": "publication date",
"obj": "01 January 1962"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Gerry Goffin"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Carole King"
},]
}
The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.
The structure of the repo is as the following.
benchmark the code used to generate the benchmarkevaluation evaluation scripts for calculating the resultsThis benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.
[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.
[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open Text reported $2.84 in Dividend Yield for its fiscal quarter ending in September of 2025. Data for Open Text | OTC - Dividend Yield including historical, tables and charts were last updated by Trading Economics this last December in 2025.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Contains a graph representation of the the English dictionary, where each word is a node and its edges are defined when a word appears in a definition. The JSON file is of the form:
JSON
{word: [Each, word, in, its, definition]
... }
Use this dataset to explore the structure of natural language!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
EventNarrative: A large-scale Event-centric Dataset for Knowledge Graph-to-Text Generation Accepted at the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). 2021. Authors: Anthony Colas, Ali Sadeghian, Yue Wang, Daisy Wang University of Florida
A knowledge graph-to-text dataset from publicly available open-world knowledge graphs. EventNarrative consists of approximately 230,000 graphs and their corresponding natural language text, 6 times larger than the current largest parallel dataset. It makes use of a rich ontology, all of the KGs entities are linked to the text, and our manual annotations confirm a high data quality.
If you find our dataset useful, please cite:
@inproceedings{colas2021eventnarrative,
title={EventNarrative: A Large-scale Event-centric Dataset for Knowledge Graph-to-Text Generation},
author={Colas, Anthony and Sadeghian, Ali and Wang, Yue and Wang, Daisy Zhe},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)},
year={2021}
}
Facebook
Twitterfery1234/table-text-clip dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comments from participants about the Table Caption and Text Search with Table Display, paraphrased, after using all of the interface options and then revisiting the table search view.
Facebook
TwitterReal-world data (RWD) in the medical field, such as electronic health records (EHRs) and medication orders, are receiving increasing attention from researchers and practitioners. While structured data have played a vital role thus far, unstructured data represented by text (e.g., discharge summaries) are not effectively utilized because of the difficulty in extracting medical information. We evaluated the information gained by supplementing structured data with clinical concepts extracted from unstructured text by leveraging natural language processing techniques. Using a machine learning-based pretrained named entity recognition tool, we extracted disease and medication names from real discharge summaries in a Japanese hospital and linked them to medical concepts using medical term dictionaries. By comparing the diseases and medications mentioned in the text with medical codes in tabular diagnosis records, we found that: (1) the text data contained richer information on patient symptoms than tabular diagnosis records, whereas the medication-order table stored more injection data than text. In addition, (2) extractable information regarding specific diseases showed surprisingly small intersections among text, diagnosis records, and medication orders. Text data can thus be a useful supplement for RWD mining, which is further demonstrated by (3) our practical application system for drug safety evaluation, which exhaustively visualizes suspicious adverse drug effects caused by the simultaneous use of anticancer drug pairs. We conclude that proper use of textual information extraction can lead to better outcomes in medical RWD mining.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Four tables and 23 figures of this paper. Table 1 shows the concept space comparison of existing taxonomies. Table 2 presents Hearst pattern examples. Table 3 shows labeling guideline for conceptualization. Table 4 presents precision of short text understanding. Figure 1 shows the framework overviews. Figure 2 is local taxonomy construction. Figure 3 shows horizontal merging. Figure 4 shows vertical merging: single sense alignment. Figure 5 shows vertical merging: multiple sense alignment. Figure 6 is a subgraph of heterogeneous semantic network around watch. Figure 7 is the compression procedure of typed-term co-occurrence network. Figure 8 presents an example of short text understanding. Figure 9 present examples of Chain model and Pairwise model. Figure 10 is a snapshot of the Probase browser. Figure 11 is a snapshot of single instance conceptualization.Figure 12 is a snapshot of context-aware single instance conceptualization. Figure 13 shows an example of short text conceptualization. Figure 14 is the framework of topic search. Figure 15 is a snapshot of the Web tables. Figure 16 shows query recommendation snapshot. Figure 17 shows the correlation of CTR with ads relevance score. Figure 18 presents the distribution of concepts in Microsoft Concept Graph. Figure 19 shows concept coverage of different taxonomies. Figure 20 shows precision of extracted isA pairs on 40 concepts.Figure 21 is precision of isA pairs after each iteration. Figure 22 shows the number of discovered concepts and isA pairs after each iteration. Figure 23 shows precision and nDCG comparison.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EduMKG is a multimodal educational knowledge graph dataset that covers natural sciences (biology, physics, and chemistry) in middle and high school education. It includes multimodal concepts covering text, images, videos, and audio, as well as knowledge points and exercises extracted from curriculum standards and MOOCs. EduMKG comprises 34,630 multimodal concepts and 403,400 triples, making it a vital resource for research in multimodal educational applications.
New Document Description
EduMKG.rdf : The RDF-formatted knowledge graph adheres to the IRI standards and has already been uploaded to this dataset.
Accessing the SPARQL Endpoint and Performing Queries: We provide an example for reference.
Access the Apache Jena Fuseki UI
Enter the username and password: user, userPassword.
Example usage:
Example 1
# Query the concept of “上臂骨骼肌” corresponding to its explanation
PREFIX ex: <http://v1.edumkg.org/>
SELECT ?explanation
WHERE {
?concept a ex:Concept .
?concept ex:hasAnExplanation ?explanation .
FILTER(CONTAINS(STR(?concept), ENCODE_FOR_URI("上臂骨骼肌")))
}
Example 2
# Randomly select 10 sets of "subject-predicate-object" triplet data from the database and display them. PREFIX ex: <http://v1.edumkg.org/ SELECT ?subject ?predicate ?object WHERE { ?subject ?predicate ?object } LIMIT 10
Zip Description
RawData: This folder comprises the publicly available raw data, including images derived from course materials, subtitles extracted from MOOC videos, audio files associated with specific concepts, and supplementary files generated through multimodal concept extraction and alignment.
EduMKG: This folder contains the entire EduMKG, including the triple files of EduMKG as well as the cross-indexing files between multimodal concepts, knowledge points, and images
Example: The examples of the Case Study for Multimodal Alignment Section in the paper.
Groundtruth: The groundtruth for Evaluation on Concept Extraction Section in the paper.
Document Description
Documents in EduMKG
{subject}Image.json: This is a cross-indexing file between images and concepts in the multimodal discipline. Each image has a unique ID and is associated with a varying number of concepts.
{subject}Knowledge.json: This is a cross-reference file for knowledge points. Each knowledge point has a unique ID, along with URLs for corresponding exercises and IDs of related knowledge points.
{subject}.json: This is a multimodal concept cross-indexing file involving text, images, videos, audio, and knowledge points. Each multimodal concept has a unique ID.
{subjectTriples}.json: This file contains all the triples of EduMKG.
Documents in Example
Each folder contains text, image, video, and audio knowledge of multimodal concepts. To avoid copyright issues, we only provide video URLs and the time intervals where the concepts appear.
Documents in Groundtruth
Each `groundtruth.txt` contains the concepts corresponding to the course with the same name as the folder. The contents were manually annotated by three PhD students and seven master's students, with cross-validation performed to ensure accuracy. The `groundtruth.txt` files are used in the "Evaluation on Concept Extraction" section of the paper to validate the effectiveness of the proposed concept extraction method.
Documents in RawData
Includes images corresponding to multimodal concepts, as well as intermediate files generated during the construction of the knowledge graph, facilitating the reproducibility of the EduMKG construction process.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Slovenia Exports of used or new rags, scrap twine of text material to Portugal was US$1.09 Thousand during 2023, according to the United Nations COMTRADE database on international trade. Slovenia Exports of used or new rags, scrap twine of text material to Portugal - data, historical chart and statistics - was last updated on November of 2025.
Facebook
TwitterTackling Hallucinations in Neural Chart Summarization
Introduction
The trained model for investigations and state-of-the-art (SOTA) improvements are detailed in the paper: Tackling Hallucinations in Neural Chart Summarization. This repo contains optimized input prompts and summaries after NLI-filtering.
Abstract
Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we address the problem of… See the full description on the dataset page: https://huggingface.co/datasets/saadob12/chart-to-text.