Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of model performance (median (Q1-Q3))
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used to generate figure 6 and 7.Figure 6: Analysis of human breast cancer (Block A Section 1), from 10XGenomics Visium Spatial Gene Expression 1.0.0. demonstration samples. A) SIMLR partitioning in 9 clusters (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf). B) Cell stability score plot for SIMLR clusters in A (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf. C) SIMLR clusters location in the tissue section (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_spatial_Stability.pdf). D) Hematoxylin and eosin image (figure6and7/HBC_BAS1/spatial/V1_Breast_Cancer_Block_A_Section_1_image.tif).Figure 6: Analysis of human breast cancer (Block A Section 1), from 10XGenomics Visium Spatial Gene Expression 1.0.0. demonstration samples. A) SIMLR partitioning in 9 clusters (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf). B) Cell stability score plot for SIMLR clusters in A (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf. C) SIMLR clusters location in the tissue section (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_spatial_Stability.pdf). D) Hematoxylin and eosin image (figure6and7/HBC_BAS1/spatial/V1_Breast_Cancer_Block_A_Section_1_image.tif).Figure 7: Information contents extracted by SCA analysis using a TF-based latent space. A) QCC (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_stabilityPlot.pdf). B) QCM (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_stabilityPlotUNBIAS.pdf). C) QCM/QCC plot, where only cluster 7 show, for the majority of the cells, both QCC and QCM greater than 0.5 (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_StabilitySignificativityJittered.pdf). D) COMET analysis of SCA latent space. SOX5 was detected as first top ranked gene specific for cluster 7, using as input for COMET the latent space frequency table (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/outputvis/cluster_7_singleton/rank_1.png). Input counts table for SCA analysis is made by raw counts.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The paper presents an original application of the recently proposed spatial data mining method named GraphRECAP on daily commuting flows using 2011 Albanian census data. Its aim is to identify several clusters of Albanian municipalities/communes; propose a classification of the Albanian territory based on daily commuting flows among municipalities/communes. Starting from 373 local units, we first applied a spatial clustering technique without imposing any constraining strategy. Based on the input variables, we obtained 16 clusters. In the second step of our analysis, we impose a set of constraining parameters to identify intermediate areas between the local level (municipality/commune) and the national one. We have defined 12 derived regions (same number as the actual Albanian prefectures but with different geographies). These derived regions are quite different from the traditional ones in terms of both geographical dimensions and boundaries.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy.
Methods
Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification.
Data Source Collection
Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly.
Data Preprocessing:
Data preprocessing shall be done to remove noise and outlier.
Transformation:
The data shall be transformed from analog to electronic record.
Data Partitioning
The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2.
The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics.
Classification and prediction:
Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows:
i. Data collection and preprocessing shall be done.
ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification.
iii. Test data set is shall be stored in database test data set.
iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows:
Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance comparison of the three classifiers on the CHD data.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).
This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (If I am mistaken and the public license type disallowed me from doing so, I will take this down if requested.)
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm)
KNIME is a great tool (GUI) that can be used for this.
1 - File Reader (for csv) to linear correlation node and to interactive histogram for basic EDA.
2- File Reader to 'Rule Engine Node' to turn the 10 point scale to dichtome variable (good wine and rest), the code to put in the rule engine is something like this:
- $quality$ > 6.5 => "good"
- TRUE => "bad"
3- Rule Engine Node output to input of Column Filter node to filter out your original 10point feature (this prevent leaking)
4- Column Filter Node output to input of Partitioning Node (your standard train/tes split, e.g. 75%/25%, choose 'random' or 'stratified')
5- Partitioning Node train data split output to input of Train data split to input Decision Tree Learner node and
6- Partitioning Node test data split output to input Decision Tree predictor Node
7- Decision Tree learner Node output to input Decision Tree Node input
8- Decision Tree output to input ROC Node.. (here you can evaluate your model base on AUC value)
Use machine learning to determine which physiochemical properties make a wine 'good'!
This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (I am mistaken and the public license type disallowed me from doing so, I will take this down at first request. I am not the owner of this dataset.
Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For reliable data treatment and interpretation qualitative descriptors were omitted and only numerical clinical indicators were included in the data matrix. Finally, the data set dimension was [729 x 18].
The data were treated by hierarchical cluster analysis and factor analysis. The major goal of the data mining was to reach statistically significant partitioning of the objects and variables into similarity patterns (clusters) which helps to better understand the data structure, to assess the meaning of the partitioning achieved, thus promoting the evaluation of the health status of the patients and the role of specific descriptors for the formation of the partitioning patterns.
3D classification Python tool.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description of nine indicator variables.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are the soil attribute products of the Tasmanian Soil Attribute Grids. There are 8 soil attribute products available from the TERN Soil Facility. Each soil attribute product is a collection of 6 depth slices. Each depth raster has an upper and lower uncertainty limit raster associated with it. The depths provided are 0-5cm, 5-15cm, 15-30cm, 30-60cm, 60-100cm & 100-200cm, consistent with the Specifications of the GlobalSoilMap.
Attributes: pH - Water (pHw); Electical Conductivity dS/m (ECD); Clay % (CLY); Sand % (SND); Silt % (SLT); Bulk Density - Whole Earth Mg/m3 (BDw); Organic Carbon % (SOC); Coarse Fragments >2mm (CFG).
These products were developed using datasets held by the Tasmanian Department of Primary Industries Parks Water & Environment (DPIPWE) Soils Database. The mapping was made by using spatial modelling and digital soil mapping (DSM) techniques to produce a fine resolution 3 arc-second grid of soil attribute values and their uncertainties, across all of Tasmania.
Note: Previous versions of this collection contained a Depth layer. This has been removed as the units do not comply with Global Soil Map specifications. Lineage: The soil attribute maps are generated using spatial modelling and digital soil mapping techniques.
Soil inventory:
Tasmanian soil site data originates from the DPIPWE soils database, a compilation of various historical soil surveys undertaken by DPIPWE, CSIRO, Forestry Tasmania and the University of Tasmania. This database contains morphological and laboratory data for all the soil sites.
Data Modelling :
A raster stack of all covariates was generated and the target variable (each soil property and depth) individually intersected with the covariate values to provide the calibration and validation data. All modelling was undertaken in ‘R’ (R Development Core Team 2012), using Regression tree (RT), specifically the Cubist R package (Kuhn, Weston et al. 2012; Kuhn, Weston et al. 2013; Quinlan 2005). The RT approach is a popular modelling approach for many disciplines (Breiman, Friedman et al. 1984), and has been widely used with DSM (Grunwald 2009; Kidd, Malone et al. 2014; McKenzie and Ryan 1999). Cubist develops the regression trees by first applying a data mining-approach to partition the calibration and explanatory covariate values into a set of structured ‘classifier’ data. The tree structure is developed by repeatedly partitioning the data into linear models until no significant measure of difference in the calibration data is determined (McBratney, Mendonça Santos et al. 2003). A series of covariate-based rules (conditions) is developed, and the linear model corresponding to the covariate conditions is applied to produce the final modelled surface. For this modelling exercise, the number of rules was set within the model controls to let the Cubist algorithm decide upon the optimum number of rules to generate.
Uncertainty Leave-one-out-cross-validation (LOOCV) was applied to the Cubist model to generate rule-based uncertainties, using only those covariates forming the conditional partitioning of that rule, following Malone et al (2014). The LOOCV, applied to an individual Cubist model for each rule, effectively produced a mean value for each RT partition, with the upper and lower 5 and 95% quantiles of the prediction variation providing the lower and upper prediction uncertainty values respectively, at the 90% Prediction Interval (PI). A 10-fold cross validation was used to run this process 10 times across all data to produce mean modelling diagnostics and validations, and reduce modelling bias due to sensitivity to training data variance.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is used for studying the space of optimal solutions of the Correlation Clustering problem. It contains both complete and incomplete signed networks, as well as their spaces of optimal solutions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the data used in the experiment of the paper submited to the following conference:N. Arinik, R. Figueiredo, V. Labatut, Signed graph analysis for the interpretation of voting behavior, in: International Conference on Knowledge Technologies and Data-driven Business - International Workshop on Social Network Analysis and Digital Humanities, Graz, AT, 2017.URL http://ceur-ws.org/Vol-2025/paper_rssna_1.pdfThe code source is accessible here: https://github.com/CompNet/NetVotes# RAW INPUT FILESThe 'itsyourparliament' folder contains all raw input files for further data processing (such as network extraction).The folder structure is as follows:* itsyourparliament/** domains: There are 28 domain files. Each file corresponds to a domain (such as Agriculture, Economy, etc.) and contains corresponding vote identifiers and their "itsyourparliament.eu" links.** meps: There 870 Member of Parliament (MEP) files. Each file contains the MEP information (such as name, country, address, etc.)** votes: There are 7513 vote files. Each file contains the votes expressed by MEPs# NETWORKS AND CORRESPONDING PARTITIONSThis work studies the voting behavior of French and Italian MEPs on "Agriculture and Rural Development" (AGRI) and "Economic and Monetary Affairs" (ECON) for each separate year of the 7th EP term (2009-10, 2010-11, 2011-12, 2012-13, 2013-14). Note that the interpretation part (section 4) of the published paper are limited to only a few instances of them (2009-10 in ECON and 2012-13 in AGRI).The extracted networks are located in the "networks" folder and the corresponding partitions are in the "partitions" folder. Both folders has the same folder structure and it is as follows:COUNTRY-NAME|_DOMAIN-NAME|_2009-10|_2010-11|_2011-12|_2012-13|_2013-14## NETWORKSThe networks in this folder are used in the article. All those networks are the ones obtained after the filtering step (as explained in the article). The networks are in 'Graphml' format. These networks are enriched with some MEPs' properties (such as name, political party, etc.) associated with each node.## ALL NETWORKSFor those who are interested in other countries or domains, we make available all possible networks that we can extract from raw data with vs. without filtering step. COUNTRY-NAME |_m3 |_negtr=NA_postr=NA: This folder contains all filtered networks. Note that the filtering step is explained in Section 2.1.2 of the article. |_bygroup |_bycountry |_negtr=0_postr=0: This folder contains all original networks (i.e. no filtering step). |_bygroup |_bycountry## PARTITIONSThe partitions are obtained in this way: First, the Ex-CC (exact) method is run and we denote 'k' for the the number of detected cluster in output. This 'k' value is the reference point in order to run the ILS-RCC (heuristic) method by specifying the number of desired cluster in output. Then, ILS-RCC is run with various values ('k', 'k+1', 'k+2'). All those results are integrated into the initial network graphml files and then converted into gephi format so that this will help dive in the results in interactive way.Note that we need to handle the absent MEPs in clustering results. Because, those MEPs correspond to isolated nodes in networks. Each isolated node is considered a single cluster node in Ex-CC results. We simply omit those nodes in order to find the 'k' (number of detected cluster) value before running ILS-RCC. Not also that ILS-RCC does not process isolated nodes such that an isolated node can be part of a cluster.# COMPARISON RESULTSThe 'material-stats' folder contains all the comparison results obtained for Ex-CC and ILS-CC. The csv files associated with plots are also provided.The folder structure is as follows:* material-stats/** execTimePerf: The plot shows the execution time of Ex-CC and ILS-CC based on randomly generated complete networks of different size.** graphStructureAnalysis: The plots show the weights and links statistics for all instances.** ILS-CC-vs-Ex-CC: The folder contains 4 different comparisons between Ex-CC and ILS-CC: Imbalance difference, number of detected clusters, difference of the number of detected clusters, NMI (Normalized Mutual Information)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data files for a research paper, "Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning," published in the Journal of the Association for Information Science and Technology.Four zipped files are uploaded.Each zipped file contains five data files: signatures_train.txt, signatures_test.txt, records.txt, clusters_train.txt, and clusters_test.txt.1. 'Signatures' files contain lists of name instances. Each name instance (a row) is associated with information as follows. - 1st column: instance id (numeric): unique id assigned to a name instance - 2nd column: paper id (numeric): unique id assigned to a paper in which the name instance appears as an author name - 3rd column: byline position (numeric): integer indicating the position of the name instance in the authorship byline of the paper - 4th column: author name (string): name string formatted as surname, comma, and forename(s) - 5th column: ethnic name group (string): name ethnicity assigned by Ethnea to the name instance - 6th column: affiliation (string): affiliation associated with the name instance, if available in the original data - 7th column: block (string): simplified name string of the name instance to indicate its block membership (surname and first forename initial) - 8th column: author id (string): unique author id (i.e., author label) assigned by the creators of the original data2. 'Records' files contain lists of papers. Each paper is associated with information as follows. -1st column: paper id (numeric): unique paper id; this is the unique paper id (2nd column) in Signatures files -2nd column: year (numeric): year of publication * Some papers may have wrong publication years due to incorrect indexing or delayed updates in original data -3rd column: venue (string): name of journal or conference in which the paper is published * Venue names can be in full string or in a shortened format according to the formats in original data -4th column: authors (string; separated by vertical bar): list of author names that appear in the paper's byline * Author names are formatted into surname, comma, and forename(s) -5th column: title words (string; separated by space): words in a title of the paper. * Note that common words are stop-listed and each remaining word is stemmed using Porter's stemmer.3. 'Clusters' files contain lists of clusters. Each cluster is associated with information as follows. -1st column: cluster id (numeric): unique id of a cluster -2nd column: list of name instance ids (Signatures - 1st column) that belong to the same unique author id (Signatures - 8th column). Signatures and Clusters files consist of two subsets - train and test files - of original labeled data which are randomly split into 50%-50% by the authors of this study.Original labeled data for AMiner.zip, KISTI.zip, and GESIS.zip came from the studies cited below.If you use one of the uploaded data files, please cite them accordingly.[AMiner.zip]Tang, J., Fong, A. C. M., Wang, B., & Zhang, J. (2012). A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975-987. doi:10.1109/Tkde.2011.13Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active Name Disambiguation. Paper presented at the 2011 IEEE 11th International Conference on Data Mining.[KISTI.zip]Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a Large-Scale Test Set for Author Disambiguation. Information Processing & Management, 47(3), 452-465. doi:10.1016/j.ipm.2010.10.001Note that the original KISTI data contain errors and duplicates. This study reuses the revised version of KISTI reported in a study below.Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867-1886. doi:10.1007/s11192-018-2824-5[GESIS.zip]Momeni, F., & Mayr, P. (2016). Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names. Paper presented at the 20th international Conference on Theory and Practice of Digital Libraries (TPDL 2016), Hannover, Germany.Note that this study reuses the 'Evaluation Set' among the original GESIS data which was added titles by a study below.Kim, J., & Kim, J. (2020). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology, 71(7), 839-855. doi:10.1002/asi.24298[UM-IRIS.zip]This labeled dataset was created for this study. For description about the labeling method, please see 'Method' in the paper below.Kim, J., Kim, J., & Owen-Smith, J. (In print). Ethnicity-based name partitioning for author name disambiguation using supervised machine learning. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24459.For details on the labeling method and limitations, see the paper below.Kim, J., & Owen-Smith, J. (2021). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the data used in the experiments of our paper:N. Arinik, R. Figueiredo, V. Labatut (2020), Multiplicity and Diversity: Analyzing the Optimal Solution Space of the Correlation Clustering Problem on Complete Signed Graphs, Journal of Complex Networks, DOI: 10.1093/comnet/cnaa025. The code source is accessible here: https://github.com/CompNet/SosoccThis dataset contains:* Plot files used in the article* Input signed networks* All optimal solutions (i.e. optimal solution space) of the corresponding networks* Evaluation files# PLOT FILES* Figure1.zip: Figures showing that there might be many distinct optimal solutions of a small-sized network.* Figure2.zip: Figures showing that distinct optimal solutions of a given network might be partition-wise very similar or different.* Figure4: All Results.zip: Figure 4 in the article contains only a few plots regarding the results for space considerations. This zip file contains all plots, and it is organized by the values of l0. In each l0 folder, the results are shown in three different perspectives: --- Detected Imbalance Percentage vs Graph Order (i.e. number of vertices) --- Prop mispl vs Graph order --- Graph order vs Prop mispl* workflow.pdf: The workflow of the methodology used in the article.* Syrian network With All Solutions.pdf: Syrian network (on top) with core part information through node colors, and its optimal solutions in which node colors represent partition information (on bottom).#NETWORKSAll networks are in Input Signed Networks.tar.gz.Networks are generated through a simple random model (available in https://github.com/CompNet/SignedBenchmark) designed to produce complete (or uncomplete) unweighted networks with built-in modular structure. There are 3 parameters used for the generation:- number of nodes (n)- initial number of modules (l0)- proportion of misplaced links, i.e. proportion of frustrated links, (qm)Inside Input Signed Networks.tar.gz:NETWORKS|_n=NB-NODE_l0=INIT_NB_MODULE_dens=1.0000....|_propMispl=PROP_MISPL ........|_propNeg=PROP_NEG ............|_network=NETWORK_NO- The first hierarchy => the folders are named as follows: n=NB-NODE_l0=INIT-NB-MODULE_dens=1.0000 The number of nodes, the initial number of modules and the network density are given. The network density is always 1, since we treat only complete signed networks.- The second hierarchy => the folders are named as follows: propMispl=PROP_MISPL Proportion of misplaced links is given.- The third hierarchy => the folders are named as follows: propNeg=PROP_NEG Proportion of negative links (qn) is specified. qn changes depending on n and l0. Since only complete signed networks are studied, this parameter is automatically computed from the other input parameters.- The fourth hierarchy => the folders are named as follows: network=NETWORK_NO Network numbers are shown.In the end, thre are three file formats describing the same network content: GraphML (.graphml), Pajek NET (.net) or .G format.# PARTITIONSAll partition results are in Partition Results.tar.gz. Note that all optimal partitions of a signed network are obtained through an exact partitioning method. The code source is accessible here: https://github.com/arinik9/ExCCInside Partition Results.tar.gz:PARTITIONS|_n=NB-NODE_l0=INIT_NB_MODULE_dens=1.0000 ....|_propMispl=PROP_MISPL ........|_propNeg=PROP_NEG ............|_network=NETWORK_NO ................|_"ExCC-all" ....................|_"signed-unweighted"- The first hierarchy => the folders are named as follows: n=NB-NODE_l0=INIT-NB-MODULE_dens=1.0000- The second hierarchy => the folders are named as follows: propMispl=PROP_MISPL- The third hierarchy => the folders are named as follows: propNeg=PROP_NEG- The fourth hierarchy => the folders are named as follows: network=NETWORK_NO- The fifth hierarchy => the folders are named as follows: "ExCC-all" The name of the partitioning method are shown. Since an exact partitioning method is used to obtain all distinct optimal solutions, it is named as "ExCC-all".- The sixth hierarchy => the folders are named as follows: "signed-unweighted" The type of signed networks are shown: signed and unweightedIn the end, the partition results are located, and the file names are named as follows: membership.txt. Note that the first partition result number starts from zero.# EVALUATIONSEvaluation results related to our plots are in Evaluation Results.tar.gz. Note that the hierarchy of this folder is the same as that of 'Partitions'. InsideEvaluation Results.tar.gz:-Best-k-for-kmedoids.csv: It contains three columns. 1) the number of solution classes via kmedoids, 2) the best Silhouette score, 3) the best clustering in terms of Silhouette score, which represents solution classes.-class-core-part-size-tresh=1.00.csv. It indicates the proportion of core part size for each solution class.-exec-time.csv: It indicates the execution time in seconds.-imbalance.csv: It contains the information of imbalance as 1) count and 2) percentage -nb-solution.csv`: It indicates the total number of solutions
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of PT/INR results anlayzed in different age groups.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of model performance (median (Q1-Q3))