2 datasets found
  1. d

    Data from: Data cleaning and enrichment through data integration: networking...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar (2025). Data cleaning and enrichment through data integration: networking the Italian academia [Dataset]. https://search.dataone.org/view/sha256%3Ab583b4db2874926c7b8d8bad19da36c9a4021fea18d77573f228fad5e332f0ff
    Explore at:
    Dataset updated
    Feb 26, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar
    Description

    We describe a bibliometric network characterizing co-authorship collaborations in the entire Italian academic community. The network, consisting of 38,220 nodes and 507,050 edges, is built upon two distinct data sources: faculty information provided by the Italian Ministry of University and Research and publications available in Semantic Scholar. Both nodes and edges are associated with a large variety of semantic data, including gender, bibliometric indexes, authors' and publications' research fields, and temporal information. While linking data between the two original sources posed many challenges, the network has been carefully validated to assess its reliability and to understand its graph-theoretic characteristics. By resembling several features of social networks, our dataset can be profitably leveraged in experimental studies in the wide social network analytics domain as well as in more specific bibliometric contexts. , The proposed network is built starting from two distinct data sources:

    the entire dataset dump from Semantic Scholar (with particular emphasis on the authors and papers datasets) the entire list of Italian faculty members as maintained by Cineca (under appointment by the Italian Ministry of University and Research).

    By means of a custom name-identity recognition algorithm (details are available in the accompanying paper published in Scientific Data), the names of the authors in the Semantic Scholar dataset have been mapped against the names contained in the Cineca dataset and authors with no match (e.g., because of not being part of an Italian university) have been discarded. The remaining authors will compose the nodes of the network, which have been enriched with node-related (i.e., author-related) attributes. In order to build the network edges, we leveraged the papers dataset from Semantic Scholar: specifically, any two authors are said to be connected if there is at least one pap..., , # Data cleaning and enrichment through data integration: networking the Italian academia

    https://doi.org/10.5061/dryad.wpzgmsbwj

    Description of the data and file structure

    This repository contains two main data files:

    • edge_data_AGG.csv, the full network in comma-separated edge list format (this file contains mainly temporal co-authorship information);
    • Coauthorship_Network_AGG.graphml, the full network in GraphML format.Â

    along with several supplementary data, listed below, useful only to build the network (i.e., for reproducibility only):

    • University-City-match.xlsx, an Excel file that maps the name of a university against the city where its respective headquarter is located;
    • Areas-SS-CINECA-match.xlsx, an Excel file that maps the research areas in Cineca against the research areas in Semantic Scholar.

    Description of the main data files

    The Coauthorship_Network_AGG.graphml is intended to be the core file which c...

  2. EB_HQ: Knowledge Graph of the First Eight Editions of the Encyclopaedia...

    • zenodo.org
    bin
    Updated Oct 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lilin Yu; Lilin Yu; Rosa Filgueira; Rosa Filgueira (2024). EB_HQ: Knowledge Graph of the First Eight Editions of the Encyclopaedia Britannica (1768-1860) Following the Heritage Textual Ontology [Dataset]. http://doi.org/10.5281/zenodo.13919115
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lilin Yu; Lilin Yu; Rosa Filgueira; Rosa Filgueira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This EB_HQ Knowledge Graph, represents information from the first eight editions of the Encyclopaedia Britannica (1768-1860), structured according to the Heritage Textual Ontology (HTO). This version enhances the information from the previously developed EB-KG by unifying data across various sources (such as National Library of Scotland, or Nineteenth-Century Knowledge Project ), editions, allowing for more comprehensive tracking of the evolution of concepts over time. For each edition, the highest-quality text source is selected to ensure optimal content. It integrates advanced information extraction methods, deep-learning-based knowledge enrichment, facilitating richer analyses.

    The EB_HQ captures 3316459 RDF triples, providing structured metadata and descriptions for each edition, volume, and term. It unifies information across various editions, allowing seamless tracking of the evolution of concepts over time. Data enrichment includes semantic linkages to external knowledge bases like DBpedia and Wikidata, facilitating broader analysis and connectivity to contemporary information.

    This dataset supports historical research, offering rich semantic data for researchers exploring the evolution of knowledge and concepts in the Encyclopaedia Britannica. It features terms categorized as Articles or Topics, each with detailed metadata extracted from METS and ALTO XML files.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar (2025). Data cleaning and enrichment through data integration: networking the Italian academia [Dataset]. https://search.dataone.org/view/sha256%3Ab583b4db2874926c7b8d8bad19da36c9a4021fea18d77573f228fad5e332f0ff

Data from: Data cleaning and enrichment through data integration: networking the Italian academia

Related Article
Explore at:
Dataset updated
Feb 26, 2025
Dataset provided by
Dryad Digital Repository
Authors
Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar
Description

We describe a bibliometric network characterizing co-authorship collaborations in the entire Italian academic community. The network, consisting of 38,220 nodes and 507,050 edges, is built upon two distinct data sources: faculty information provided by the Italian Ministry of University and Research and publications available in Semantic Scholar. Both nodes and edges are associated with a large variety of semantic data, including gender, bibliometric indexes, authors' and publications' research fields, and temporal information. While linking data between the two original sources posed many challenges, the network has been carefully validated to assess its reliability and to understand its graph-theoretic characteristics. By resembling several features of social networks, our dataset can be profitably leveraged in experimental studies in the wide social network analytics domain as well as in more specific bibliometric contexts. , The proposed network is built starting from two distinct data sources:

the entire dataset dump from Semantic Scholar (with particular emphasis on the authors and papers datasets) the entire list of Italian faculty members as maintained by Cineca (under appointment by the Italian Ministry of University and Research).

By means of a custom name-identity recognition algorithm (details are available in the accompanying paper published in Scientific Data), the names of the authors in the Semantic Scholar dataset have been mapped against the names contained in the Cineca dataset and authors with no match (e.g., because of not being part of an Italian university) have been discarded. The remaining authors will compose the nodes of the network, which have been enriched with node-related (i.e., author-related) attributes. In order to build the network edges, we leveraged the papers dataset from Semantic Scholar: specifically, any two authors are said to be connected if there is at least one pap..., , # Data cleaning and enrichment through data integration: networking the Italian academia

https://doi.org/10.5061/dryad.wpzgmsbwj

Description of the data and file structure

This repository contains two main data files:

  • edge_data_AGG.csv, the full network in comma-separated edge list format (this file contains mainly temporal co-authorship information);
  • Coauthorship_Network_AGG.graphml, the full network in GraphML format.Â

along with several supplementary data, listed below, useful only to build the network (i.e., for reproducibility only):

  • University-City-match.xlsx, an Excel file that maps the name of a university against the city where its respective headquarter is located;
  • Areas-SS-CINECA-match.xlsx, an Excel file that maps the research areas in Cineca against the research areas in Semantic Scholar.

Description of the main data files

The Coauthorship_Network_AGG.graphml is intended to be the core file which c...

Search
Clear search
Close search
Google apps
Main menu