100+ datasets found
  1. h

    chart-to-text

    • huggingface.co
    Updated Oct 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saad Obaid ul Islam (2024). chart-to-text [Dataset]. https://huggingface.co/datasets/saadob12/chart-to-text
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 28, 2024
    Authors
    Saad Obaid ul Islam
    Description

    Tackling Hallucinations in Neural Chart Summarization

      Introduction
    

    The trained model for investigations and state-of-the-art (SOTA) improvements are detailed in the paper: Tackling Hallucinations in Neural Chart Summarization. This repo contains optimized input prompts and summaries after NLI-filtering.

      Abstract
    

    Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we address the problem of… See the full description on the dataset page: https://huggingface.co/datasets/saadob12/chart-to-text.

  2. R

    Chart Text Detection Dataset

    • universe.roboflow.com
    zip
    Updated Sep 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    minhngoncoding (2024). Chart Text Detection Dataset [Dataset]. https://universe.roboflow.com/minhngoncoding/chart-text-detection
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 26, 2024
    Dataset authored and provided by
    minhngoncoding
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Text Bounding Boxes
    Description

    Chart Text Detection

    ## Overview
    
    Chart Text Detection is a dataset for object detection tasks - it contains Text annotations for 6,399 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  3. S

    Effective comment data and chart data

    • scidb.cn
    Updated Apr 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li Shancheng (2022). Effective comment data and chart data [Dataset]. http://doi.org/10.57760/sciencedb.01715
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 25, 2022
    Dataset provided by
    Science Data Bank
    Authors
    Li Shancheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    There are two files in the data file, one of which is all valid comment text data used by the paper, with a total of 297,774 pieces; the other is the data required for drawing the main graphs in the paper.

  4. h

    Text-Attributed-Graphs

    • huggingface.co
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Graph Computation and Machine Learning (GCOM) Group (2025). Text-Attributed-Graphs [Dataset]. https://huggingface.co/datasets/Graph-COM/Text-Attributed-Graphs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Graph Computation and Machine Learning (GCOM) Group
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview

    This dataset covers the encoder embeddings and prediction results of LLMs of paper 'Model Generalization on Text Attribute Graphs: Principles with Lagre Language Models', Haoyu Wang, Shikun Liu, Rongzhe Wei, Pan Li.

      Dataset Description
    

    The dataset structure should be organized as follows: /dataset/ │── [dataset_name]/ │ │── processed_data.pt # Contains labels and graph information │ │── [encoder]_x.pt # Features extracted by different encoders │… See the full description on the dataset page: https://huggingface.co/datasets/Graph-COM/Text-Attributed-Graphs.

  5. O

    Chart2Text (Chart Summarization Dataset)

    • opendatalab.com
    zip
    Updated Mar 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    York University (2023). Chart2Text (Chart Summarization Dataset) [Dataset]. https://opendatalab.com/OpenDataLab/Chart2Text
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 9, 2023
    Dataset provided by
    York University
    Description

    Chart2Text is a dataset that was crawled from 23,382 freely accessible pages from statista.com in early March of 2020, yielding a total of 8,305 charts, and associated summaries. For each chart, the chart image, the underlying data table, the title, the axis labels, and a human-written summary describing the statistic was downloaded.

  6. S

    CBCD:A Chinese Bar Chart Dataset for Data Extraction

    • scidb.cn
    Updated Nov 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ma Qiuping; Zhang Qi; Bi Hangshuo; Zhao Xiaofan (2025). CBCD:A Chinese Bar Chart Dataset for Data Extraction [Dataset]. http://doi.org/10.57760/sciencedb.j00240.00052
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 14, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Ma Qiuping; Zhang Qi; Bi Hangshuo; Zhao Xiaofan
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Currently, in the field of chart datasets, most existing resources are mainly in English, and there are almost no open-source Chinese chart datasets, which brings certain limitations to research and applications related to Chinese charts. This dataset draws on the construction method of the DVQA dataset to create a chart dataset focused on the Chinese environment. To ensure the authenticity and practicality of the dataset, we first referred to the authoritative website of the National Bureau of Statistics and selected 24 widely used data label categories in practical applications, totaling 262 specific labels. These tag categories cover multiple important areas such as socio-economic, demographic, and industrial development. In addition, in order to further enhance the diversity and practicality of the dataset, this paper sets 10 different numerical dimensions. These numerical dimensions not only provide a rich range of values, but also include multiple types of values, which can simulate various data distributions and changes that may be encountered in real application scenarios. This dataset has carefully designed various types of Chinese bar charts to cover various situations that may be encountered in practical applications. Specifically, the dataset not only includes conventional vertical and horizontal bar charts, but also introduces more challenging stacked bar charts to test the performance of the method on charts of different complexities. In addition, to further increase the diversity and practicality of the dataset, the text sets diverse attribute labels for each chart type. These attribute labels include but are not limited to whether they have data labels, whether the text is rotated 45 °, 90 °, etc. The addition of these details makes the dataset more realistic for real-world application scenarios, while also placing higher demands on data extraction methods. In addition to the charts themselves, the dataset also provides corresponding data tables and title text for each chart, which is crucial for understanding the content of the chart and verifying the accuracy of the extracted results. This dataset selects Matplotlib, the most popular and widely used data visualization library in the Python programming language, to be responsible for generating chart images required for research. Matplotlib has become the preferred tool for data scientists and researchers in data visualization tasks due to its rich features, flexible configuration options, and excellent compatibility. By utilizing the Matplotlib library, every detail of the chart can be precisely controlled, from the drawing of data points to the annotation of coordinate axes, from the addition of legends to the setting of titles, ensuring that the generated chart images not only meet the research needs, but also have high readability and attractiveness visually. The dataset consists of 58712 pairs of Chinese bar charts and corresponding data tables, divided into training, validation, and testing sets in a 7:2:1 ratio.

  7. Graphs in Text

    • kaggle.com
    zip
    Updated Aug 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghanender Pahuja (2020). Graphs in Text [Dataset]. https://www.kaggle.com/datasets/ghanender/graphs-in-text
    Explore at:
    zip(5248238 bytes)Available download formats
    Dataset updated
    Aug 9, 2020
    Authors
    Ghanender Pahuja
    Description

    Dataset

    This dataset was created by Ghanender Pahuja

    Contents

  8. Z

    Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandana Mihindukulasooriya; Sanju Tiwari; Carlos F. Enguix; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7916715
    Explore at:
    Dataset updated
    May 23, 2023
    Dataset provided by
    Sharda University, India
    IBM Research Europe
    Universidad Autonoma de Tamaulipas, Mexico
    ACM SIGMOD Professional Member
    Authors
    Nandana Mihindukulasooriya; Sanju Tiwari; Carlos F. Enguix; Kusum Lata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

    It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

    An example

    An example test sentence:

    Test Sentence: {"id": "ont_music_test_n", "sent": ""The Loco-Motion" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

    An example of ontology:

    Ontology: Music Ontology

    Expected Output:

    { "id": "ont_k_music_test_n", "sent": ""The Loco-Motion" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

    The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

    The structure of the repo is as the following.

    Text2KGBench

    src: the source code used for generation and evaluation, and baseline

    benchmark the code used to generate the benchmark

    evaluation evaluation scripts for calculating the results

    baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

    data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

    wikidata_tekgen Wikidata-TekGen Dataset

    ontologies 10 ontologies used by this dataset

    train training data

    test test data

    manually_verified_sentences ids of a subset of test cases manually validated

    unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

    test unseen test unseen test sentences

    ground_truth ground truth for unseen test sentences.

    ground_truth ground truth for the test data

    baselines data related to running the baselines.

    test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

    prompts prompts corresponding to each test file

    unseen prompts unseen prompts for the unseen test cases

    Alpaca-LoRA-13B data related to the Alpaca-LoRA model

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    unseen results results for the unseen test cases

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    Vicuna-13B data related to the Vicuna-13B model

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    dbpedia_webnlg DBpedia Dataset

    ontologies 19 ontologies used by this dataset

    train training data

    test test data

    ground_truth ground truth for the test data

    baselines data related to running the baselines.

    test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

    prompts prompts corresponding to each test file

    Alpaca-LoRA-13B data related to the Alpaca-LoRA model

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    Vicuna-13B data related to the Vicuna-13B model

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

    [1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

    [2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188, Vancouver, Canada. Association for Computational Linguistics.

  9. T

    Open Text | OTC - Market Capitalization

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Feb 22, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2018). Open Text | OTC - Market Capitalization [Dataset]. https://tradingeconomics.com/otc:cn:market-capitalization
    Explore at:
    csv, xml, excel, jsonAvailable download formats
    Dataset updated
    Feb 22, 2018
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2000 - Dec 2, 2025
    Area covered
    Canada
    Description

    Open Text reported $12.76B in Market Capitalization this December of 2025, considering the latest stock price and the number of outstanding shares.Data for Open Text | OTC - Market Capitalization including historical, tables and charts were last updated by Trading Economics this last December in 2025.

  10. E

    EconBiz Images for Text Extraction from Scholarly Figures

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    json
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). EconBiz Images for Text Extraction from Scholarly Figures [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7506
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Apr 17, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    "Scholarly figures are data visualizations like bar charts, pie charts, line graphs, maps, scatter plots or similar figures. Text extraction from scholarly figures is useful in many application scenarios, since text in scholarly figures often contains information that is not present in the surrounding text. This dataset is a corpus of 121 scholarly figures from the economics domain evaluating text extraction tools. We randomly extracted these figures from a corpus of 288,000 open access publications from EconBiz. The dataset resembles a wide variety of scholarly figures from bar charts to maps. We manually labeled the figures to create the gold standard.

    We adjusted the provided gold standard to have a uniform format for all datasets. Each figure is accompanied by a TSV file (tab-separated values) where each entry corresponds to a text line which has the following structure:

    X-coordinate of the center of the bounding box in pixel

    Y-coordinate of the center of the bounding box in pixel

    Width of the bounding box in pixel

    Height of the bounding box in pixel

    Rotation angle around its center in degree

    Text inside the bounding box

    In addition we provide the ground truth in JSON format. A schema file is included in each dataset as well. The dataset is accompanied with a ReadMe file with further information about the figures and their origin.

    If you use this dataset in your own work, please cite one of the papers in the references."

  11. Statistical and text graph data of each dataset.

    • figshare.com
    xls
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hend Alrasheed (2023). Statistical and text graph data of each dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0255127.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Hend Alrasheed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of words and number of tokens denote the number of words in the dataset before and after preprocessing respectively. Direct edges and indirect edges represent the number of direct and indirect synonym relationships between words in the text graph respectively.

  12. T

    Open Text | OTC - Assets

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2025). Open Text | OTC - Assets [Dataset]. https://tradingeconomics.com/otc:cn:assets
    Explore at:
    xml, excel, json, csvAvailable download formats
    Dataset updated
    Sep 15, 2025
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2000 - Dec 2, 2025
    Area covered
    Canada
    Description

    Open Text reported $13.48B in Assets for its fiscal quarter ending in September of 2025. Data for Open Text | OTC - Assets including historical, tables and charts were last updated by Trading Economics this last December in 2025.

  13. Knowledge Graph Dataset

    • kaggle.com
    zip
    Updated Nov 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    lariseak (2025). Knowledge Graph Dataset [Dataset]. https://www.kaggle.com/datasets/lariseak/knowledge-graph-dataset
    Explore at:
    zip(2037713 bytes)Available download formats
    Dataset updated
    Nov 16, 2025
    Authors
    lariseak
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    We provide a new, publicly available dataset of the extracted knowledge graphs (from both REBEL and Gemini) for the Reuters-21578, BBC and AG News and 20 news groups benchmarks. This resource can be used to benchmark other graph-based and knowledge-aware classification methods.

  14. Z

    Data from: Graphine: A Dataset for Graph-aware Terminology Definition...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zequn Liu; Shukai Wang; Yiyang Gu; Ruiyi Zhang; Ming Zhang; Sheng Wang (2021). Graphine: A Dataset for Graph-aware Terminology Definition Generation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5320309
    Explore at:
    Dataset updated
    Sep 6, 2021
    Dataset provided by
    Peking University
    University of Washington
    Authors
    Zequn Liu; Shukai Wang; Yiyang Gu; Ruiyi Zhang; Ming Zhang; Sheng Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset of our EMNLP 2021 paper:

    Graphine: A Dataset for Graph-aware Terminology Definition Generation.

    Please read the "readme.md" in it for the format of the dataset.

  15. Top 100 Billboard

    • kaggle.com
    zip
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sujay Kapadnis (2023). Top 100 Billboard [Dataset]. https://www.kaggle.com/datasets/sujaykapadnis/top-100-billboard
    Explore at:
    zip(28541119 bytes)Available download formats
    Dataset updated
    Sep 25, 2023
    Authors
    Sujay Kapadnis
    Description

    The data this week comes from Data.World by way of Sean Miller, Billboard.com and Spotify.

    Billboard Top 100 - Wikipedia

    The Billboard Hot 100 is the music industry standard record chart in the United States for songs, published weekly by Billboard magazine. Chart rankings are based on sales (physical and digital), radio play, and online streaming in the United States.

    Billboard Top 100 Article

    Drake rewrites the record for the most entries ever on the Billboard Hot 100, as he lands his 208th career title on the latest list, dated March 21

    Data Dictionary

    billboard.csv

    variableclassdescription
    urlcharacterBillboard Chart URL
    week_idcharacterWeek ID
    week_positiondoubleWeek position 1: 100
    songcharacterSong name
    performercharacterPerformer name
    song_idcharacterSong ID, combo of song/singer
    instancedoubleInstance (this is used to separate breaks on the chart for a given song. Example, an instance of 6 tells you that this is the sixth time this song has appeared on the chart)
    previous_week_positiondoublePrevious week position
    peak_positiondoublePeak position as of that week
    weeks_on_chartdoubleWeeks on chart as of that week

    audio_features.csv

    variableclassdescription
    song_idcharacterSong ID
    performercharacterPerformer name
    songcharacterSong
    spotify_genrecharacterGenre
    spotify_track_idcharacterTrack ID
    spotify_track_preview_urlcharacterSpotify URL
    spotify_track_duration_msdoubleDuration in ms
    spotify_track_explicitlogicalIs explicit
    spotify_track_albumcharacterAlbum name
    danceabilitydoubleDanceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
    energydoubleEnergy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
    keydoubleThe estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
    loudnessdoubleThe overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
    modedoubleMode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
    speechinessdoubleSpeechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
    acousticnessdoubleA confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
    instrumentalnessdoublePredicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
    livenessdoubleDetects the presence of an audience in the recording. Higher liveness values represent an increased probability that t...
  16. r

    Building a graph database for digital humanities scientists

    • resodate.org
    Updated Jan 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Triet Doan (2023). Building a graph database for digital humanities scientists [Dataset]. http://doi.org/10.25625/O9IRPY
    Explore at:
    Dataset updated
    Jan 1, 2023
    Dataset provided by
    Georg-August-Universität Göttingen
    GRO.data
    Authors
    Triet Doan
    Description

    Graph database has developed rapidly and plays an important role in research nowadays. It helps scientists in various ways, e.g., finding related works, exploring works in a research area, or gaining knowledge from connections between different nodes. There are already some graph databases for research available on the Internet. However, they do not meet the needs of Digital Humanities (DH) scientists, who mainly work with historical data. Therefore, we create a graph database specifically for DH scientists. This database is part of MINE, a service that facilitates data acquisition and big data analysis.

  17. Statistical and text graph data of each abstract in the HULTH dataset.

    • figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hend Alrasheed (2023). Statistical and text graph data of each abstract in the HULTH dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0255127.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Hend Alrasheed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of words and number of tokens denote the number of words in the dataset before and after preprocessing respectively. Direct edges and indirect edges represent the number of direct and indirect synonym relationships between words in the text graph respectively.

  18. T

    Open Text | OTC - Debt

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Jun 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2025). Open Text | OTC - Debt [Dataset]. https://tradingeconomics.com/otc:cn:debt
    Explore at:
    xml, json, excel, csvAvailable download formats
    Dataset updated
    Jun 15, 2025
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2000 - Dec 3, 2025
    Area covered
    Canada
    Description

    Open Text reported $4.57M in Debt for its fiscal quarter ending in June of 2025. Data for Open Text | OTC - Debt including historical, tables and charts were last updated by Trading Economics this last December in 2025.

  19. i

    Graph Neural NMF Enhanced by Optimal Transport: Short Text Topic Modeling...

    • ieee-dataport.org
    Updated Oct 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bing Zhao (2025). Graph Neural NMF Enhanced by Optimal Transport: Short Text Topic Modeling with Pretrained Language Models and Nonparametric Baye [Dataset]. https://ieee-dataport.org/documents/graph-neural-nmf-enhanced-optimal-transport-short-text-topic-modeling-pretrained-language
    Explore at:
    Dataset updated
    Oct 21, 2025
    Authors
    Bing Zhao
    Description

    and semantic bias

  20. EduMKG: A Multimodal Knowledge Graph for Education with Text, Image, Video...

    • zenodo.org
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tong Lu; Tong Lu (2025). EduMKG: A Multimodal Knowledge Graph for Education with Text, Image, Video and Audio [Dataset]. http://doi.org/10.5281/zenodo.15694552
    Explore at:
    Dataset updated
    Jul 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tong Lu; Tong Lu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    EduMKG is a multimodal educational knowledge graph dataset that covers natural sciences (biology, physics, and chemistry) in middle and high school education. It includes multimodal concepts covering text, images, videos, and audio, as well as knowledge points and exercises extracted from curriculum standards and MOOCs. EduMKG comprises 34,630 multimodal concepts and 403,400 triples, making it a vital resource for research in multimodal educational applications.

    🎯🎯🎯Recent Update for EduMKG

    1. We have open-sourced the RDF data of EduMKG (following IRI standards)
    2. We have open-sourced an automation script for converting JSON to RDF format on GitHub at https://github.com/AI-BNU-TEAMKG/EduMKG.
    3. We have released a SPARQL endpoint and provided basic usage examples.
    4. We have released the schema of EduMKG on GitHub at https://github.com/AI-BNU-TEAMKG/EduMKG.
    5. We have released the validation results for alignment quality on GitHub at https://github.com/AI-BNU-TEAMKG/EduMKG

    New Document Description

    EduMKG.rdf : The RDF-formatted knowledge graph adheres to the IRI standards and has already been uploaded to this dataset.

    SPARQL Endpoint Url and Usage Instructions

    Accessing the SPARQL Endpoint and Performing Queries: We provide an example for reference.

    • Access the Apache Jena Fuseki UI

    • Enter the username and password: user, userPassword.

    • Example usage:

      Example 1


      # Query the concept of “上臂骨骼肌” corresponding to its explanation
      PREFIX ex: <http://v1.edumkg.org/>
      SELECT ?explanation
      WHERE {
      ?concept a ex:Concept .
      ?concept ex:hasAnExplanation ?explanation .
      FILTER(CONTAINS(STR(?concept), ENCODE_FOR_URI("上臂骨骼肌")))
      }

      Example 2

      # Randomly select 10 sets of "subject-predicate-object" triplet data from the database and display them.
      PREFIX ex: <http://v1.edumkg.org/
      SELECT ?subject ?predicate ?object
      WHERE {
      ?subject ?predicate ?object
      }
      LIMIT 10


    Zip Description
    RawData: This folder comprises the publicly available raw data, including images derived from course materials, subtitles extracted from MOOC videos, audio files associated with specific concepts, and supplementary files generated through multimodal concept extraction and alignment.
    EduMKG: This folder contains the entire EduMKG, including the triple files of EduMKG as well as the cross-indexing files between multimodal concepts, knowledge points, and images
    Example: The examples of the Case Study for Multimodal Alignment Section in the paper.
    Groundtruth: The groundtruth for Evaluation on Concept Extraction Section in the paper.

    Document Description

    Documents in EduMKG
    {subject}Image.json: This is a cross-indexing file between images and concepts in the multimodal discipline. Each image has a unique ID and is associated with a varying number of concepts.
    {subject}Knowledge.json: This is a cross-reference file for knowledge points. Each knowledge point has a unique ID, along with URLs for corresponding exercises and IDs of related knowledge points.
    {subject}.json: This is a multimodal concept cross-indexing file involving text, images, videos, audio, and knowledge points. Each multimodal concept has a unique ID.
    {subjectTriples}.json: This file contains all the triples of EduMKG.


    Documents in Example

    Each folder contains text, image, video, and audio knowledge of multimodal concepts. To avoid copyright issues, we only provide video URLs and the time intervals where the concepts appear.


    Documents in Groundtruth

    Each `groundtruth.txt` contains the concepts corresponding to the course with the same name as the folder. The contents were manually annotated by three PhD students and seven master's students, with cross-validation performed to ensure accuracy. The `groundtruth.txt` files are used in the "Evaluation on Concept Extraction" section of the paper to validate the effectiveness of the proposed concept extraction method.


    Documents in RawData
    Includes images corresponding to multimodal concepts, as well as intermediate files generated during the construction of the knowledge graph, facilitating the reproducibility of the EduMKG construction process.




Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Saad Obaid ul Islam (2024). chart-to-text [Dataset]. https://huggingface.co/datasets/saadob12/chart-to-text

chart-to-text

saadob12/chart-to-text

Explore at:
307 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 28, 2024
Authors
Saad Obaid ul Islam
Description

Tackling Hallucinations in Neural Chart Summarization

  Introduction

The trained model for investigations and state-of-the-art (SOTA) improvements are detailed in the paper: Tackling Hallucinations in Neural Chart Summarization. This repo contains optimized input prompts and summaries after NLI-filtering.

  Abstract

Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we address the problem of… See the full description on the dataset page: https://huggingface.co/datasets/saadob12/chart-to-text.

Search
Clear search
Close search
Google apps
Main menu