100+ datasets found

h
chart-to-text
huggingface.co
Updated Oct 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saad Obaid ul Islam (2024). chart-to-text [Dataset]. https://huggingface.co/datasets/saadob12/chart-to-text
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 28, 2024
Authors
Saad Obaid ul Islam
Description
Tackling Hallucinations in Neural Chart Summarization

Introduction

The trained model for investigations and state-of-the-art (SOTA) improvements are detailed in the paper: Tackling Hallucinations in Neural Chart Summarization. This repo contains optimized input prompts and summaries after NLI-filtering.

Abstract

Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we address the problem of… See the full description on the dataset page: https://huggingface.co/datasets/saadob12/chart-to-text.
R
Chart Text Detection Dataset
universe.roboflow.com
zip
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
minhngoncoding (2024). Chart Text Detection Dataset [Dataset]. https://universe.roboflow.com/minhngoncoding/chart-text-detection
Explore at:
zipAvailable download formats
Dataset updated
Sep 26, 2024
Dataset authored and provided by
minhngoncoding
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Text Bounding Boxes
Description
Chart Text Detection

## Overview Chart Text Detection is a dataset for object detection tasks - it contains Text annotations for 6,399 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
S
Effective comment data and chart data
scidb.cn
Updated Apr 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Shancheng (2022). Effective comment data and chart data [Dataset]. http://doi.org/10.57760/sciencedb.01715
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.01715
Dataset updated
Apr 25, 2022
Dataset provided by
Science Data Bank
Authors
Li Shancheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
There are two files in the data file, one of which is all valid comment text data used by the paper, with a total of 297,774 pieces; the other is the data required for drawing the main graphs in the paper.
h
Text-Attributed-Graphs
huggingface.co
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Graph Computation and Machine Learning (GCOM) Group (2025). Text-Attributed-Graphs [Dataset]. https://huggingface.co/datasets/Graph-COM/Text-Attributed-Graphs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Graph Computation and Machine Learning (GCOM) Group
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview

This dataset covers the encoder embeddings and prediction results of LLMs of paper 'Model Generalization on Text Attribute Graphs: Principles with Lagre Language Models', Haoyu Wang, Shikun Liu, Rongzhe Wei, Pan Li.

Dataset Description

The dataset structure should be organized as follows: /dataset/ │── [dataset_name]/ │ │── processed_data.pt # Contains labels and graph information │ │── [encoder]_x.pt # Features extracted by different encoders │… See the full description on the dataset page: https://huggingface.co/datasets/Graph-COM/Text-Attributed-Graphs.
S
CBCD:A Chinese Bar Chart Dataset for Data Extraction
scidb.cn
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ma Qiuping; Zhang Qi; Bi Hangshuo; Zhao Xiaofan (2025). CBCD:A Chinese Bar Chart Dataset for Data Extraction [Dataset]. http://doi.org/10.57760/sciencedb.j00240.00052
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00240.00052
Dataset updated
Nov 14, 2025
Dataset provided by
Science Data Bank
Authors
Ma Qiuping; Zhang Qi; Bi Hangshuo; Zhao Xiaofan
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Currently, in the field of chart datasets, most existing resources are mainly in English, and there are almost no open-source Chinese chart datasets, which brings certain limitations to research and applications related to Chinese charts. This dataset draws on the construction method of the DVQA dataset to create a chart dataset focused on the Chinese environment. To ensure the authenticity and practicality of the dataset, we first referred to the authoritative website of the National Bureau of Statistics and selected 24 widely used data label categories in practical applications, totaling 262 specific labels. These tag categories cover multiple important areas such as socio-economic, demographic, and industrial development. In addition, in order to further enhance the diversity and practicality of the dataset, this paper sets 10 different numerical dimensions. These numerical dimensions not only provide a rich range of values, but also include multiple types of values, which can simulate various data distributions and changes that may be encountered in real application scenarios. This dataset has carefully designed various types of Chinese bar charts to cover various situations that may be encountered in practical applications. Specifically, the dataset not only includes conventional vertical and horizontal bar charts, but also introduces more challenging stacked bar charts to test the performance of the method on charts of different complexities. In addition, to further increase the diversity and practicality of the dataset, the text sets diverse attribute labels for each chart type. These attribute labels include but are not limited to whether they have data labels, whether the text is rotated 45 °, 90 °, etc. The addition of these details makes the dataset more realistic for real-world application scenarios, while also placing higher demands on data extraction methods. In addition to the charts themselves, the dataset also provides corresponding data tables and title text for each chart, which is crucial for understanding the content of the chart and verifying the accuracy of the extracted results. This dataset selects Matplotlib, the most popular and widely used data visualization library in the Python programming language, to be responsible for generating chart images required for research. Matplotlib has become the preferred tool for data scientists and researchers in data visualization tasks due to its rich features, flexible configuration options, and excellent compatibility. By utilizing the Matplotlib library, every detail of the chart can be precisely controlled, from the drawing of data points to the annotation of coordinate axes, from the addition of legends to the setting of titles, ensuring that the generated chart images not only meet the research needs, but also have high readability and attractiveness visually. The dataset consists of 58712 pairs of Chinese bar charts and corresponding data tables, divided into training, validation, and testing sets in a 7:2:1 ratio.
Knowledge Graph Dataset
kaggle.com
zip
Updated Nov 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lariseak (2025). Knowledge Graph Dataset [Dataset]. https://www.kaggle.com/datasets/lariseak/knowledge-graph-dataset
Explore at:
zip(2037713 bytes)Available download formats
Dataset updated
Nov 16, 2025
Authors
lariseak
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
We provide a new, publicly available dataset of the extracted knowledge graphs (from both REBEL and Gemini) for the Reuters-21578, BBC and AG News and 20 news groups benchmarks. This resource can be used to benchmark other graph-based and knowledge-aware classification methods.
R
Text And Diagram Finder.v02 Dataset
universe.roboflow.com
zip
Updated Jan 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
diagram detection set (2025). Text And Diagram Finder.v02 Dataset [Dataset]. https://universe.roboflow.com/diagram-detection-set/text-and-diagram-finder.v02
Explore at:
zipAvailable download formats
Dataset updated
Jan 1, 2025
Dataset authored and provided by
diagram detection set
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Questions Bounding Boxes
Description
Text And Diagram Finder.v02

## Overview Text And Diagram Finder.v02 is a dataset for object detection tasks - it contains Questions annotations for 557 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Main text figure data
catalog.data.gov
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). Main text figure data [Dataset]. https://catalog.data.gov/dataset/main-text-figure-data
Explore at:
Dataset updated
Jul 18, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Raw underlying data for visualizations in the main body of the manuscript. This dataset is associated with the following publication: Champion, W., M. MacDonald, B. Thomas, S. Bantupalli, and E. Thoma. Methane sensor characterization using colocated ambient comparisons and simulated emission challenges. ACS ES&T Air. American Chemical Society, Washington, DC, USA, 0, (2025).

Top 100 Billboard

kaggle.com

zip

Updated Sep 25, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Sujay Kapadnis (2023). Top 100 Billboard [Dataset]. https://www.kaggle.com/datasets/sujaykapadnis/top-100-billboard

Explore at:

zip(28541119 bytes)Available download formats

Dataset updated

Sep 25, 2023

Authors

Sujay Kapadnis

Description

The data this week comes from Data.World by way of Sean Miller, Billboard.com and Spotify.

Billboard Top 100 - Wikipedia

The Billboard Hot 100 is the music industry standard record chart in the United States for songs, published weekly by Billboard magazine. Chart rankings are based on sales (physical and digital), radio play, and online streaming in the United States.

Billboard Top 100 Article

Drake rewrites the record for the most entries ever on the Billboard Hot 100, as he lands his 208th career title on the latest list, dated March 21

Data Dictionary

`billboard.csv`

variable	class	description
url	character	Billboard Chart URL
week_id	character	Week ID
week_position	double	Week position 1: 100
song	character	Song name
performer	character	Performer name
song_id	character	Song ID, combo of song/singer
instance	double	Instance (this is used to separate breaks on the chart for a given song. Example, an instance of 6 tells you that this is the sixth time this song has appeared on the chart)
previous_week_position	double	Previous week position
peak_position	double	Peak position as of that week
weeks_on_chart	double	Weeks on chart as of that week

`audio_features.csv`

variable	class	description
song_id	character	Song ID
performer	character	Performer name
song	character	Song
spotify_genre	character	Genre
spotify_track_id	character	Track ID
spotify_track_preview_url	character	Spotify URL
spotify_track_duration_ms	double	Duration in ms
spotify_track_explicit	logical	Is explicit
spotify_track_album	character	Album name
danceability	double	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	double	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	double	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness	double	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	double	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	double	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	double	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	double	Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	double	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that t...

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...
zenodo.org
data.niaid.nih.gov
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7916716
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence: {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

An example of ontology:

Ontology: Music Ontology

Expected Output:

{ "id": "ont_k_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

Text2KGBench

src: the source code used for generation and evaluation, and baseline

benchmark the code used to generate the benchmark

evaluation evaluation scripts for calculating the results

baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

wikidata_tekgen Wikidata-TekGen Dataset

ontologies 10 ontologies used by this dataset

train training data

test test data

manually_verified_sentences ids of a subset of test cases manually validated

unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

test unseen test unseen test sentences

ground_truth ground truth for unseen test sentences.

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

unseen prompts unseen prompts for the unseen test cases

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

unseen results results for the unseen test cases

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

dbpedia_webnlg DBpedia Dataset

ontologies 19 ontologies used by this dataset

train training data

test test data

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
T
Open Text | OTC - Market Capitalization
tradingeconomics.com
csv, excel, json, xml
Updated Feb 22, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2018). Open Text | OTC - Market Capitalization [Dataset]. https://tradingeconomics.com/otc:cn:market-capitalization
Explore at:
csv, xml, excel, jsonAvailable download formats
Dataset updated
Feb 22, 2018
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2000 - Dec 2, 2025
Area covered
Canada
Description
Open Text reported $12.76B in Market Capitalization this December of 2025, considering the latest stock price and the number of outstanding shares.Data for Open Text | OTC - Market Capitalization including historical, tables and charts were last updated by Trading Economics this last December in 2025.
Code might be found under:...
plos.figshare.com
zip
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agata Skorupka (2024). Code might be found under: https://kaggle.com/code/agatasko/anomalies-graph-networks. [Dataset]. http://doi.org/10.1371/journal.pone.0315849.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315849.s001
Dataset updated
Dec 23, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Agata Skorupka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Technical appendix can be found under: https://www.kaggle.com/datasets/agatasko/tech-appendix. List of supplements: plots:a. 01_TwiBot_20_histograms.htmlb. 02_Bitcoin_OTC_histograms.htmlc. 03_Bitcoin_Alpha_histograms.htmld. 04_TwiBot_20_dimensionality.htmle. 05_Bitcoin_OTC_dimensionality.htmlf. 06_Bitcoin_Alpha_dimensionality.htmltables:a. 01_TwiBot_20_statistics.csvb. 02_Bitcoin_OTC_statistics.csvc. 03_Bitcoin_Alpha_statistics.csvd. 04_TwiBot_20_results.csve. 05_Bitcoin_OTC_results.csvf. 06_Bitcoin_Alpha_results.csvg. 07_TwiBot_20_compression_results.csvh. 08_Bitcoin_OTC_compression_results.csvi. 09_Bitcoin_Alpha_compression_results.csv plots: a. 01_TwiBot_20_histograms.html b. 02_Bitcoin_OTC_histograms.html c. 03_Bitcoin_Alpha_histograms.html d. 04_TwiBot_20_dimensionality.html e. 05_Bitcoin_OTC_dimensionality.html f. 06_Bitcoin_Alpha_dimensionality.html tables: a. 01_TwiBot_20_statistics.csv b. 02_Bitcoin_OTC_statistics.csv c. 03_Bitcoin_Alpha_statistics.csv d. 04_TwiBot_20_results.csv e. 05_Bitcoin_OTC_results.csv f. 06_Bitcoin_Alpha_results.csv g. 07_TwiBot_20_compression_results.csv h. 08_Bitcoin_OTC_compression_results.csv i. 09_Bitcoin_Alpha_compression_results.csv (ZIP)
Dictionary Graph
kaggle.com
zip
Updated Sep 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bfbarry (2021). Dictionary Graph [Dataset]. https://www.kaggle.com/bfbarry/dictionary-graph
Explore at:
zip(3759523 bytes)Available download formats
Dataset updated
Sep 7, 2021
Authors
bfbarry
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dictionary Network Graph

Contains a graph representation of the the English dictionary, where each word is a node and its edges are defined when a word appears in a definition. The JSON file is of the form: JSON {word: [Each, word, in, its, definition] ... }

Use this dataset to explore the structure of natural language!
EventNarrative
kaggle.com
zip
Updated Jun 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
acolas1 (2021). EventNarrative [Dataset]. https://www.kaggle.com/acolas1/eventnarration
Explore at:
zip(39735780 bytes)Available download formats
Dataset updated
Jun 7, 2021
Authors
acolas1
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
EventNarrative: A large-scale Event-centric Dataset for Knowledge Graph-to-Text Generation Accepted at the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). 2021. Authors: Anthony Colas, Ali Sadeghian, Yue Wang, Daisy Wang University of Florida

A knowledge graph-to-text dataset from publicly available open-world knowledge graphs. EventNarrative consists of approximately 230,000 graphs and their corresponding natural language text, 6 times larger than the current largest parallel dataset. It makes use of a rich ontology, all of the KGs entities are linked to the text, and our manual annotations confirm a high data quality.

If you find our dataset useful, please cite: @inproceedings{colas2021eventnarrative, title={EventNarrative: A Large-scale Event-centric Dataset for Knowledge Graph-to-Text Generation}, author={Colas, Anthony and Sadeghian, Ali and Wang, Yue and Wang, Daisy Zhe}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)}, year={2021} }
NLP feature set variables for TwiBot-20.
plos.figshare.com
xls
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agata Skorupka (2024). NLP feature set variables for TwiBot-20. [Dataset]. http://doi.org/10.1371/journal.pone.0315849.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315849.t001
Dataset updated
Dec 23, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Agata Skorupka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The study examines different graph-based methods of detecting anomalous activities on digital markets, proposing the most efficient way to increase market actors’ protection and reduce information asymmetry. Anomalies are defined below as both bots and fraudulent users (who can be both bots and real people). Methods are compared against each other, and state-of-the-art results from the literature and a new algorithm is proposed. The goal is to find an efficient method suitable for threat detection, both in terms of predictive performance and computational efficiency. It should scale well and remain robust on the advancements of the newest technologies. The article utilized three publicly accessible graph-based datasets: one describing the Twitter social network (TwiBot-20) and two describing Bitcoin cryptocurrency markets (Bitcoin OTC and Bitcoin Alpha). In the former, an anomaly is defined as a bot, as opposed to a human user, whereas in the latter, an anomaly is a user who conducted a fraudulent transaction, which may (but does not have to) imply being a bot. The study proves that graph-based data is a better-performing predictor than text data. It compares different graph algorithms to extract feature sets for anomaly detection models. It states that methods based on nodes’ statistics result in better model performance than state-of-the-art graph embeddings. They also yield a significant improvement in computational efficiency. This often means reducing the time by hours or enabling modeling on significantly larger graphs (usually not feasible in the case of embeddings). On that basis, the article proposes its own graph-based statistics algorithm. Furthermore, using embeddings requires two engineering choices: the type of embedding and its dimension. The research examines whether there are types of graph embeddings and dimensions that perform significantly better than others. The solution turned out to be dataset-specific and needed to be tailored on a case-by-case basis, adding even more engineering overhead to using embeddings (building a leaderboard of grid of embedding instances, where each of them takes hours to be generated). This, again, speaks in favor of the proposed algorithm based on nodes’ statistics. The research proposes its own efficient algorithm, which makes this engineering overhead redundant.
r
Building a graph database for digital humanities scientists
resodate.org
Updated Jan 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Triet Doan (2023). Building a graph database for digital humanities scientists [Dataset]. http://doi.org/10.25625/O9IRPY
Explore at:
Unique identifier
https://doi.org/10.25625/O9IRPY
Dataset updated
Jan 1, 2023
Dataset provided by
Georg-August-Universität Göttingen
GRO.data
Authors
Triet Doan
Description
Graph database has developed rapidly and plays an important role in research nowadays. It helps scientists in various ways, e.g., finding related works, exploring works in a research area, or gaining knowledge from connections between different nodes. There are already some graph databases for research available on the Internet. However, they do not meet the needs of Digital Humanities (DH) scientists, who mainly work with historical data. Therefore, we create a graph database specifically for DH scientists. This database is part of MINE, a service that facilitates data acquisition and big data analysis.
S
Data from: Microsoft Concept Graph: Mining Semantic Concepts for Short Text...
scidb.cn
Updated Oct 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lei Ji; Yujing Wang; Botian Shi; Dawei Zhang; Zhongyuan Wang; Jun Yan (2020). Microsoft Concept Graph: Mining Semantic Concepts for Short Text Understanding [Dataset]. http://doi.org/10.11922/sciencedb.j00104.00047
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.11922/sciencedb.j00104.00047
Dataset updated
Oct 16, 2020
Dataset provided by
Science Data Bank
Authors
Lei Ji; Yujing Wang; Botian Shi; Dawei Zhang; Zhongyuan Wang; Jun Yan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Four tables and 23 figures of this paper. Table 1 shows the concept space comparison of existing taxonomies. Table 2 presents Hearst pattern examples. Table 3 shows labeling guideline for conceptualization. Table 4 presents precision of short text understanding. Figure 1 shows the framework overviews. Figure 2 is local taxonomy construction. Figure 3 shows horizontal merging. Figure 4 shows vertical merging: single sense alignment. Figure 5 shows vertical merging: multiple sense alignment. Figure 6 is a subgraph of heterogeneous semantic network around watch. Figure 7 is the compression procedure of typed-term co-occurrence network. Figure 8 presents an example of short text understanding. Figure 9 present examples of Chain model and Pairwise model. Figure 10 is a snapshot of the Probase browser. Figure 11 is a snapshot of single instance conceptualization.Figure 12 is a snapshot of context-aware single instance conceptualization. Figure 13 shows an example of short text conceptualization. Figure 14 is the framework of topic search. Figure 15 is a snapshot of the Web tables. Figure 16 shows query recommendation snapshot. Figure 17 shows the correlation of CTR with ads relevance score. Figure 18 presents the distribution of concepts in Microsoft Concept Graph. Figure 19 shows concept coverage of different taxonomies. Figure 20 shows precision of extracted isA pairs on 40 concepts.Figure 21 is precision of isA pairs after each iteration. Figure 22 shows the number of discovered concepts and isA pairs after each iteration. Figure 23 shows precision and nDCG comparison.
f
Statistics of TABLE and TEXT.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Sep 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kawazoe, Yoshimasa; Hori, Satoko; Aramaki, Eiji; Nishiyama, Tomohiro; Yada, Shuntaro; Imai, Shungo; Wakamiya, Shoko (2024). Statistics of TABLE and TEXT. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001351015
Explore at:
Dataset updated
Sep 11, 2024
Authors
Kawazoe, Yoshimasa; Hori, Satoko; Aramaki, Eiji; Nishiyama, Tomohiro; Yada, Shuntaro; Imai, Shungo; Wakamiya, Shoko
Description
Real-world data (RWD) in the medical field, such as electronic health records (EHRs) and medication orders, are receiving increasing attention from researchers and practitioners. While structured data have played a vital role thus far, unstructured data represented by text (e.g., discharge summaries) are not effectively utilized because of the difficulty in extracting medical information. We evaluated the information gained by supplementing structured data with clinical concepts extracted from unstructured text by leveraging natural language processing techniques. Using a machine learning-based pretrained named entity recognition tool, we extracted disease and medication names from real discharge summaries in a Japanese hospital and linked them to medical concepts using medical term dictionaries. By comparing the diseases and medications mentioned in the text with medical codes in tabular diagnosis records, we found that: (1) the text data contained richer information on patient symptoms than tabular diagnosis records, whereas the medication-order table stored more injection data than text. In addition, (2) extractable information regarding specific diseases showed surprisingly small intersections among text, diagnosis records, and medication orders. Text data can thus be a useful supplement for RWD mining, which is further demonstrated by (3) our practical application system for drug safety evaluation, which exhaustively visualizes suspicious adverse drug effects caused by the simultaneous use of anticancer drug pairs. We conclude that proper use of textual information extraction can lead to better outcomes in medical RWD mining.
Chart 3.9.1 Community Supports Utilization Rates by MCP and County in the...
data.chhs.ca.gov
data.ca.gov
+2more
Updated Oct 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Health Care Services (2025). Chart 3.9.1 Community Supports Utilization Rates by MCP and County in the Last 12 Months of the Reporting Period [Dataset]. https://data.chhs.ca.gov/dataset/chart-3-9-1-community-supports-utilization-rates-by-mcp-and-county-in-the-last-12-months-of-the
Explore at:
html, geojson, csv, zip, kml, arcgis geoservices rest apiAvailable download formats
Dataset updated
Oct 9, 2025
Dataset provided by
California Department of Health Care Serviceshttp://www.dhcs.ca.gov/
Authors
Department of Health Care Services
Description
ECM Community Support Services tables for a Quarterly Implementation Report. Including the County and Plan Details for both ECM and Community Support.
This Medi-Cal Enhanced Care Management (ECM) and Community Supports Calendar Year Quarterly Implementation Report provides a comprehensive overview of ECM and Community Supports implementation in the programs' first year. It includes data at the state, county, and plan levels on total members served, utilization, and provider networks.
ECM is a statewide MCP benefit that provides person-centered, community-based care management to the highest need members. The Department of Health Care Services (DHCS) and its MCP partners began implementing ECM in phases by Populations of Focus (POFs), with the first three POFs launching statewide in CY 2022.
Community Supports are services that address members’ health-related social needs and help them avoid higher, costlier levels of care. Although it is optional for MCPs to offer these services, every Medi-Cal MCP offered Community Supports in 2022, and at least two Community Supports services were offered and available in every county by the end of the year.
T
Open Text | OTC - Dividend Yield
tradingeconomics.com
csv, excel, json, xml
Updated Sep 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2025). Open Text | OTC - Dividend Yield [Dataset]. https://tradingeconomics.com/otc:cn:dy
Explore at:
json, excel, xml, csvAvailable download formats
Dataset updated
Sep 15, 2025
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2000 - Dec 2, 2025
Area covered
Canada
Description
Open Text reported $2.84 in Dividend Yield for its fiscal quarter ending in September of 2025. Data for Open Text | OTC - Dividend Yield including historical, tables and charts were last updated by Trading Economics this last December in 2025.

Facebook

Twitter

Click to copy link

Link copied

Cite

Saad Obaid ul Islam (2024). chart-to-text [Dataset]. https://huggingface.co/datasets/saadob12/chart-to-text

chart-to-text

saadob12/chart-to-text

Explore at:

307 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 28, 2024

Authors

Saad Obaid ul Islam

Description

Tackling Hallucinations in Neural Chart Summarization

  Introduction

The trained model for investigations and state-of-the-art (SOTA) improvements are detailed in the paper: Tackling Hallucinations in Neural Chart Summarization. This repo contains optimized input prompts and summaries after NLI-filtering.

  Abstract

Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we address the problem of… See the full description on the dataset page: https://huggingface.co/datasets/saadob12/chart-to-text.

Clear search

Close search

Google apps

Main menu

chart-to-text

Chart Text Detection Dataset

Chart Text Detection

Effective comment data and chart data

Text-Attributed-Graphs

CBCD:A Chinese Bar Chart Dataset for Data Extraction

Knowledge Graph Dataset

Text And Diagram Finder.v02 Dataset

Text And Diagram Finder.v02

Main text figure data

Top 100 Billboard

Data Dictionary

billboard.csv

audio_features.csv

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

Open Text | OTC - Market Capitalization

Code might be found under:...

Dictionary Graph

Dictionary Network Graph

EventNarrative

NLP feature set variables for TwiBot-20.

Building a graph database for digital humanities scientists

Data from: Microsoft Concept Graph: Mining Semantic Concepts for Short Text...

Statistics of TABLE and TEXT.

Chart 3.9.1 Community Supports Utilization Rates by MCP and County in the...

Open Text | OTC - Dividend Yield

chart-to-textSee More Versions

saadob12/chart-to-text

`billboard.csv`

`audio_features.csv`

chart-to-text