5 datasets found

f
Data from: Partition MCMC for Inference on Acyclic Digraphs
tandf.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack Kuipers; Giusi Moffa (2023). Partition MCMC for Inference on Acyclic Digraphs [Dataset]. http://doi.org/10.6084/m9.figshare.2069687.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.2069687.v2
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
Jack Kuipers; Giusi Moffa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Acyclic digraphs are the underlying representation of Bayesian networks, a widely used class of probabilistic graphical models. Learning the underlying graph from data is a way of gaining insights about the structural properties of a domain. Structure learning forms one of the inference challenges of statistical graphical models. Markov chain Monte Carlo (MCMC) methods, notably structure MCMC, to sample graphs from the posterior distribution given the data are probably the only viable option for Bayesian model averaging. Score modularity and restrictions on the number of parents of each node allow the graphs to be grouped into larger collections, which can be scored as a whole to improve the chain’s convergence. Current examples of algorithms taking advantage of grouping are the biased order MCMC, which acts on the alternative space of permuted triangular matrices, and nonergodic edge reversal moves. Here, we propose a novel algorithm, which employs the underlying combinatorial structure of DAGs to define a new grouping. As a result convergence is improved compared to structure MCMC, while still retaining the property of producing an unbiased sample. Finally, the method can be combined with edge reversal moves to improve the sampler further. Supplementary materials for this article are available online.
f
Data from: Novel Aggregate Deletion/Substitution/Addition Learning...
tandf.figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam B. Olshen; Robert L. Strawderman; Gregory Ryslik; Karen Lostritto; Alice M. Arnold; Annette M. Molinaro (2023). Novel Aggregate Deletion/Substitution/Addition Learning Algorithms for Recursive Partitioning [Dataset]. http://doi.org/10.6084/m9.figshare.4892000.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4892000.v2
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Adam B. Olshen; Robert L. Strawderman; Gregory Ryslik; Karen Lostritto; Alice M. Arnold; Annette M. Molinaro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many complex diseases are caused by a variety of both genetic and environmental factors acting in conjunction. To help understand these relationships, nonparametric methods that use aggregate learning have been developed such as random forests and conditional forests. Molinaro et al. (2010) described a powerful, single model approach called partDSA that has the advantage of producing interpretable models. We propose two extensions to the partDSA algorithm called bagged partDSA and boosted partDSA. These algorithms achieve higher prediction accuracies than individual partDSA objects through aggregating over a set of partDSA objects. Further, by using partDSA objects in the ensemble, each base learner creates decision rules using both “and” and “or” statements, which allows for natural logical constructs. We also provide four variable ranking techniques that aid in identifying the most important individual factors in the models. In the regression context, we compared bagged partDSA and boosted partDSA to random forests and conditional forests. Using simulated and real data, we found that bagged partDSA had lower prediction error than the other methods if the data were generated by a simple logic model, and that it performed similarly for other generating mechanisms. We also found that boosted partDSA was effective for a particularly complex case. Taken together these results suggest that the new methods are useful additions to the ensemble learning toolbox. We implement these algorithms as part of the partDSA R package. Supplementary materials for this article are available online.
Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation...
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinseok Kim; Jenna Kim; Jason Owen-Smith (2023). Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.14043791.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14043791.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Jinseok Kim; Jenna Kim; Jason Owen-Smith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data files for a research paper, "Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning," published in the Journal of the Association for Information Science and Technology.Four zipped files are uploaded.Each zipped file contains five data files: signatures_train.txt, signatures_test.txt, records.txt, clusters_train.txt, and clusters_test.txt.1. 'Signatures' files contain lists of name instances. Each name instance (a row) is associated with information as follows. - 1st column: instance id (numeric): unique id assigned to a name instance - 2nd column: paper id (numeric): unique id assigned to a paper in which the name instance appears as an author name - 3rd column: byline position (numeric): integer indicating the position of the name instance in the authorship byline of the paper - 4th column: author name (string): name string formatted as surname, comma, and forename(s) - 5th column: ethnic name group (string): name ethnicity assigned by Ethnea to the name instance - 6th column: affiliation (string): affiliation associated with the name instance, if available in the original data - 7th column: block (string): simplified name string of the name instance to indicate its block membership (surname and first forename initial) - 8th column: author id (string): unique author id (i.e., author label) assigned by the creators of the original data2. 'Records' files contain lists of papers. Each paper is associated with information as follows. -1st column: paper id (numeric): unique paper id; this is the unique paper id (2nd column) in Signatures files -2nd column: year (numeric): year of publication * Some papers may have wrong publication years due to incorrect indexing or delayed updates in original data -3rd column: venue (string): name of journal or conference in which the paper is published * Venue names can be in full string or in a shortened format according to the formats in original data -4th column: authors (string; separated by vertical bar): list of author names that appear in the paper's byline * Author names are formatted into surname, comma, and forename(s) -5th column: title words (string; separated by space): words in a title of the paper. * Note that common words are stop-listed and each remaining word is stemmed using Porter's stemmer.3. 'Clusters' files contain lists of clusters. Each cluster is associated with information as follows. -1st column: cluster id (numeric): unique id of a cluster -2nd column: list of name instance ids (Signatures - 1st column) that belong to the same unique author id (Signatures - 8th column). Signatures and Clusters files consist of two subsets - train and test files - of original labeled data which are randomly split into 50%-50% by the authors of this study.Original labeled data for AMiner.zip, KISTI.zip, and GESIS.zip came from the studies cited below.If you use one of the uploaded data files, please cite them accordingly.[AMiner.zip]Tang, J., Fong, A. C. M., Wang, B., & Zhang, J. (2012). A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975-987. doi:10.1109/Tkde.2011.13Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active Name Disambiguation. Paper presented at the 2011 IEEE 11th International Conference on Data Mining.[KISTI.zip]Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a Large-Scale Test Set for Author Disambiguation. Information Processing & Management, 47(3), 452-465. doi:10.1016/j.ipm.2010.10.001Note that the original KISTI data contain errors and duplicates. This study reuses the revised version of KISTI reported in a study below.Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867-1886. doi:10.1007/s11192-018-2824-5[GESIS.zip]Momeni, F., & Mayr, P. (2016). Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names. Paper presented at the 20th international Conference on Theory and Practice of Digital Libraries (TPDL 2016), Hannover, Germany.Note that this study reuses the 'Evaluation Set' among the original GESIS data which was added titles by a study below.Kim, J., & Kim, J. (2020). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology, 71(7), 839-855. doi:10.1002/asi.24298[UM-IRIS.zip]This labeled dataset was created for this study. For description about the labeling method, please see 'Method' in the paper below.Kim, J., Kim, J., & Owen-Smith, J. (In print). Ethnicity-based name partitioning for author name disambiguation using supervised machine learning. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24459.For details on the labeling method and limitations, see the paper below.Kim, J., & Owen-Smith, J. (2021). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6
f
Results on DDB14 and WN18RR.
plos.figshare.com
xls
Updated Apr 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chunjuan Li; Hong Zheng; Gang Liu (2025). Results on DDB14 and WN18RR. [Dataset]. http://doi.org/10.1371/journal.pone.0315782.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315782.t005
Dataset updated
Apr 29, 2025
Dataset provided by
PLOS ONE
Authors
Chunjuan Li; Hong Zheng; Gang Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Federated learning ensures that data can be trained globally across clients without leaving the local environment, making it suitable for fields involving privacy data such as healthcare and finance. The knowledge graph technology provides a way to express the knowledge of the Internet into a form more similar to the human cognitive world. The training of the knowledge graph embedding model is similar to that of many models, which requires a large amount of data for learning to achieve the purpose of model development. The security of data has always been a focus of public attention, and driven by this situation, knowledge graphs have begun to be combined with federated learning. However, the combination of the two often faces the problem of federated data statistical heterogeneity, which can affect the performance of the training model. Therefore, An Algorithm for Heterogeneous Federated Knowledge Graph (HFKG) is proposed to solve this problem by limiting model drift through comparative learning. In addition, during the training process, it was found that both the server aggregation algorithm and the client knowledge graph embedding model performance can affect the overall performance of the algorithm.Therefore, a new server aggregation algorithm and knowledge graph embedding model RFE are proposed. This paper uses the DDB14, WN18RR, and NELL datasets and two methods of dataset partitioning to construct data heterogeneity scenarios for extensive experiments. The experimental results show a stable improvement, proving the effectiveness of the federated knowledge graph embedding aggregation algorithm HFKG-RFE, the knowledge graph embedding model RFE and the federated knowledge graph relationship embedding aggregation algorithm HFKG-RFE formed by the combination of the two.
f
Knowledge graph model rating function.
plos.figshare.com
xls
Updated Apr 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chunjuan Li; Hong Zheng; Gang Liu (2025). Knowledge graph model rating function. [Dataset]. http://doi.org/10.1371/journal.pone.0315782.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315782.t001
Dataset updated
Apr 29, 2025
Dataset provided by
PLOS ONE
Authors
Chunjuan Li; Hong Zheng; Gang Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Federated learning ensures that data can be trained globally across clients without leaving the local environment, making it suitable for fields involving privacy data such as healthcare and finance. The knowledge graph technology provides a way to express the knowledge of the Internet into a form more similar to the human cognitive world. The training of the knowledge graph embedding model is similar to that of many models, which requires a large amount of data for learning to achieve the purpose of model development. The security of data has always been a focus of public attention, and driven by this situation, knowledge graphs have begun to be combined with federated learning. However, the combination of the two often faces the problem of federated data statistical heterogeneity, which can affect the performance of the training model. Therefore, An Algorithm for Heterogeneous Federated Knowledge Graph (HFKG) is proposed to solve this problem by limiting model drift through comparative learning. In addition, during the training process, it was found that both the server aggregation algorithm and the client knowledge graph embedding model performance can affect the overall performance of the algorithm.Therefore, a new server aggregation algorithm and knowledge graph embedding model RFE are proposed. This paper uses the DDB14, WN18RR, and NELL datasets and two methods of dataset partitioning to construct data heterogeneity scenarios for extensive experiments. The experimental results show a stable improvement, proving the effectiveness of the federated knowledge graph embedding aggregation algorithm HFKG-RFE, the knowledge graph embedding model RFE and the federated knowledge graph relationship embedding aggregation algorithm HFKG-RFE formed by the combination of the two.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jack Kuipers; Giusi Moffa (2023). Partition MCMC for Inference on Acyclic Digraphs [Dataset]. http://doi.org/10.6084/m9.figshare.2069687.v2

Data from: Partition MCMC for Inference on Acyclic Digraphs

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.2069687.v2

Dataset updated

Jun 1, 2023

Dataset provided by

Taylor & Francis

Authors

Jack Kuipers; Giusi Moffa

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Acyclic digraphs are the underlying representation of Bayesian networks, a widely used class of probabilistic graphical models. Learning the underlying graph from data is a way of gaining insights about the structural properties of a domain. Structure learning forms one of the inference challenges of statistical graphical models. Markov chain Monte Carlo (MCMC) methods, notably structure MCMC, to sample graphs from the posterior distribution given the data are probably the only viable option for Bayesian model averaging. Score modularity and restrictions on the number of parents of each node allow the graphs to be grouped into larger collections, which can be scored as a whole to improve the chain’s convergence. Current examples of algorithms taking advantage of grouping are the biased order MCMC, which acts on the alternative space of permuted triangular matrices, and nonergodic edge reversal moves. Here, we propose a novel algorithm, which employs the underlying combinatorial structure of DAGs to define a new grouping. As a result convergence is improved compared to structure MCMC, while still retaining the property of producing an unbiased sample. Finally, the method can be combined with edge reversal moves to improve the sampler further. Supplementary materials for this article are available online.

Clear search

Close search

Google apps

Main menu

Data from: Partition MCMC for Inference on Acyclic Digraphs

Data from: Novel Aggregate Deletion/Substitution/Addition Learning...

Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation...

Results on DDB14 and WN18RR.

Knowledge graph model rating function.

Data from: Partition MCMC for Inference on Acyclic Digraphs