14 datasets found
  1. Summary of model performance (median (Q1-Q3))

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang (2023). Summary of model performance (median (Q1-Q3)) [Dataset]. http://doi.org/10.1371/journal.pone.0177811.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of model performance (median (Q1-Q3))

  2. Figure 6 and 7 from manuscript Sparsely-Connected Autoencoder (SCA) for...

    • figshare.com
    zip
    Updated Aug 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raffaele Calogero (2020). Figure 6 and 7 from manuscript Sparsely-Connected Autoencoder (SCA) for single cell RNAseq data mining [Dataset]. http://doi.org/10.6084/m9.figshare.12866897.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 26, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Raffaele Calogero
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset used to generate figure 6 and 7.Figure 6: Analysis of human breast cancer (Block A Section 1), from 10XGenomics Visium Spatial Gene Expression 1.0.0. demonstration samples. A) SIMLR partitioning in 9 clusters (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf). B) Cell stability score plot for SIMLR clusters in A (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf. C) SIMLR clusters location in the tissue section (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_spatial_Stability.pdf). D) Hematoxylin and eosin image (figure6and7/HBC_BAS1/spatial/V1_Breast_Cancer_Block_A_Section_1_image.tif).Figure 6: Analysis of human breast cancer (Block A Section 1), from 10XGenomics Visium Spatial Gene Expression 1.0.0. demonstration samples. A) SIMLR partitioning in 9 clusters (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf). B) Cell stability score plot for SIMLR clusters in A (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf. C) SIMLR clusters location in the tissue section (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_spatial_Stability.pdf). D) Hematoxylin and eosin image (figure6and7/HBC_BAS1/spatial/V1_Breast_Cancer_Block_A_Section_1_image.tif).Figure 7: Information contents extracted by SCA analysis using a TF-based latent space. A) QCC (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_stabilityPlot.pdf). B) QCM (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_stabilityPlotUNBIAS.pdf). C) QCM/QCC plot, where only cluster 7 show, for the majority of the cells, both QCC and QCM greater than 0.5 (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_StabilitySignificativityJittered.pdf). D) COMET analysis of SCA latent space. SOX5 was detected as first top ranked gene specific for cluster 7, using as input for COMET the latent space frequency table (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/outputvis/cluster_7_singleton/rank_1.png). Input counts table for SCA analysis is made by raw counts.

  3. H

    Data from: Graph Regionalization with Clustering and Partitioning: an...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Sep 23, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BENASSI FEDERICO (2015). Graph Regionalization with Clustering and Partitioning: an Application for Daily Commuting Flows in Albania [Dataset]. http://doi.org/10.7910/DVN/3AVOGY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 23, 2015
    Dataset provided by
    Harvard Dataverse
    Authors
    BENASSI FEDERICO
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Albania
    Description

    The paper presents an original application of the recently proposed spatial data mining method named GraphRECAP on daily commuting flows using 2011 Albanian census data. Its aim is to identify several clusters of Albanian municipalities/communes; propose a classification of the Albanian territory based on daily commuting flows among municipalities/communes. Starting from 373 local units, we first applied a spatial clustering technique without imposing any constraining strategy. Based on the input variables, we obtained 16 clusters. In the second step of our analysis, we impose a set of constraining parameters to identify intermediate areas between the local level (municipality/commune) and the national one. We have defined 12 derived regions (same number as the actual Albanian prefectures but with different geographies). These derived regions are quite different from the traditional ones in terms of both geographical dimensions and boundaries.

  4. n

    Malaria disease and grading system dataset from public hospitals reflecting...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Nov 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 10, 2023
    Dataset provided by
    Nasarawa State University
    Authors
    Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
    Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.

  5. Performance comparison of the three classifiers on the CHD data.

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang (2023). Performance comparison of the three classifiers on the CHD data. [Dataset]. http://doi.org/10.1371/journal.pone.0177811.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance comparison of the three classifiers on the CHD data.

  6. Data from: Red Wine Quality

    • kaggle.com
    zip
    Updated Nov 27, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI Machine Learning (2017). Red Wine Quality [Dataset]. https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
    Explore at:
    zip(26176 bytes)Available download formats
    Dataset updated
    Nov 27, 2017
    Dataset authored and provided by
    UCI Machine Learning
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

    These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

    This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (If I am mistaken and the public license type disallowed me from doing so, I will take this down if requested.)

    Content

    For more information, read [Cortez et al., 2009].
    Input variables (based on physicochemical tests):
    1 - fixed acidity
    2 - volatile acidity
    3 - citric acid
    4 - residual sugar
    5 - chlorides
    6 - free sulfur dioxide
    7 - total sulfur dioxide
    8 - density
    9 - pH
    10 - sulphates
    11 - alcohol
    Output variable (based on sensory data):
    12 - quality (score between 0 and 10)

    Tips

    What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm)

    KNIME is a great tool (GUI) that can be used for this.
    1 - File Reader (for csv) to linear correlation node and to interactive histogram for basic EDA.
    2- File Reader to 'Rule Engine Node' to turn the 10 point scale to dichtome variable (good wine and rest), the code to put in the rule engine is something like this:
    - $quality$ > 6.5 => "good"
    - TRUE => "bad"
    3- Rule Engine Node output to input of Column Filter node to filter out your original 10point feature (this prevent leaking)
    4- Column Filter Node output to input of Partitioning Node (your standard train/tes split, e.g. 75%/25%, choose 'random' or 'stratified')
    5- Partitioning Node train data split output to input of Train data split to input Decision Tree Learner node and
    6- Partitioning Node test data split output to input Decision Tree predictor Node
    7- Decision Tree learner Node output to input Decision Tree Node input
    8- Decision Tree output to input ROC Node.. (here you can evaluate your model base on AUC value)

    Inspiration

    Use machine learning to determine which physiochemical properties make a wine 'good'!

    Acknowledgements

    This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (I am mistaken and the public license type disallowed me from doing so, I will take this down at first request. I am not the owner of this dataset.

    Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

    Relevant publication

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

  7. The input data set includes 729 objects (patients) and 39 variables...

    • zenodo.org
    csv, png +1
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miroslava Nedyalkova; Miroslava Nedyalkova (2024). The input data set includes 729 objects (patients) and 39 variables (clinical qualitative and quantitative descriptors). [Dataset]. http://doi.org/10.5281/zenodo.6652207
    Explore at:
    png, csv, text/x-pythonAvailable download formats
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Miroslava Nedyalkova; Miroslava Nedyalkova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    For reliable data treatment and interpretation qualitative descriptors were omitted and only numerical clinical indicators were included in the data matrix. Finally, the data set dimension was [729 x 18].

    The data were treated by hierarchical cluster analysis and factor analysis. The major goal of the data mining was to reach statistically significant partitioning of the objects and variables into similarity patterns (clusters) which helps to better understand the data structure, to assess the meaning of the partitioning achieved, thus promoting the evaluation of the health status of the patients and the role of specific descriptors for the formation of the partitioning patterns.

    3D classification Python tool.

  8. Description of nine indicator variables.

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang (2023). Description of nine indicator variables. [Dataset]. http://doi.org/10.1371/journal.pone.0177811.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description of nine indicator variables.

  9. Soil and Landscape Grid Digital Soil Property Maps for Tasmania (3"...

    • researchdata.edu.au
    • data.csiro.au
    datadownload
    Updated Nov 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex McBratney; Budiman Minasny; Brendan Malone; Mathew Webb; Darren Kidd (2022). Soil and Landscape Grid Digital Soil Property Maps for Tasmania (3" resolution) [Dataset]. http://doi.org/10.4225/08/5AAF364C54CC8
    Explore at:
    datadownloadAvailable download formats
    Dataset updated
    Nov 24, 2022
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Alex McBratney; Budiman Minasny; Brendan Malone; Mathew Webb; Darren Kidd
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1947 - Sep 30, 2014
    Area covered
    Description

    These are the soil attribute products of the Tasmanian Soil Attribute Grids. There are 8 soil attribute products available from the TERN Soil Facility. Each soil attribute product is a collection of 6 depth slices. Each depth raster has an upper and lower uncertainty limit raster associated with it. The depths provided are 0-5cm, 5-15cm, 15-30cm, 30-60cm, 60-100cm & 100-200cm, consistent with the Specifications of the GlobalSoilMap.

    Attributes: pH - Water (pHw); Electical Conductivity dS/m (ECD); Clay % (CLY); Sand % (SND); Silt % (SLT); Bulk Density - Whole Earth Mg/m3 (BDw); Organic Carbon % (SOC); Coarse Fragments >2mm (CFG).

    These products were developed using datasets held by the Tasmanian Department of Primary Industries Parks Water & Environment (DPIPWE) Soils Database. The mapping was made by using spatial modelling and digital soil mapping (DSM) techniques to produce a fine resolution 3 arc-second grid of soil attribute values and their uncertainties, across all of Tasmania.

    Note: Previous versions of this collection contained a Depth layer. This has been removed as the units do not comply with Global Soil Map specifications. Lineage: The soil attribute maps are generated using spatial modelling and digital soil mapping techniques.

    Soil inventory:

    Tasmanian soil site data originates from the DPIPWE soils database, a compilation of various historical soil surveys undertaken by DPIPWE, CSIRO, Forestry Tasmania and the University of Tasmania. This database contains morphological and laboratory data for all the soil sites.

    Data Modelling :

    A raster stack of all covariates was generated and the target variable (each soil property and depth) individually intersected with the covariate values to provide the calibration and validation data. All modelling was undertaken in ‘R’ (R Development Core Team 2012), using Regression tree (RT), specifically the Cubist R package (Kuhn, Weston et al. 2012; Kuhn, Weston et al. 2013; Quinlan 2005). The RT approach is a popular modelling approach for many disciplines (Breiman, Friedman et al. 1984), and has been widely used with DSM (Grunwald 2009; Kidd, Malone et al. 2014; McKenzie and Ryan 1999). Cubist develops the regression trees by first applying a data mining-approach to partition the calibration and explanatory covariate values into a set of structured ‘classifier’ data. The tree structure is developed by repeatedly partitioning the data into linear models until no significant measure of difference in the calibration data is determined (McBratney, Mendonça Santos et al. 2003). A series of covariate-based rules (conditions) is developed, and the linear model corresponding to the covariate conditions is applied to produce the final modelled surface. For this modelling exercise, the number of rules was set within the model controls to let the Cubist algorithm decide upon the optimum number of rules to generate.

    Uncertainty Leave-one-out-cross-validation (LOOCV) was applied to the Cubist model to generate rule-based uncertainties, using only those covariates forming the conditional partitioning of that rule, following Malone et al (2014). The LOOCV, applied to an individual Cubist model for each rule, effectively produced a mean value for each RT partition, with the upper and lower 5 and 95% quantiles of the prediction variation providing the lower and upper prediction uncertainty values respectively, at the 90% Prediction Interval (PI). A 10-fold cross validation was used to run this process 10 times across all data to produce mean modelling diagnostics and validations, and reduce modelling bias due to sensitivity to training data variance.

  10. Learning optimal solution characteristics of the Correlation Clustering...

    • figshare.com
    application/gzip
    Updated Mar 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nejat Arinik; Vincent Labatut (2022). Learning optimal solution characteristics of the Correlation Clustering problem [Dataset]. http://doi.org/10.6084/m9.figshare.19350284.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 14, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Nejat Arinik; Vincent Labatut
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is used for studying the space of optimal solutions of the Correlation Clustering problem. It contains both complete and incomplete signed networks, as well as their spaces of optimal solutions.

  11. NetVotes 2017 - iKnow’17

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nejat Arinik; Vincent Labatut (2023). NetVotes 2017 - iKnow’17 [Dataset]. http://doi.org/10.6084/m9.figshare.5785833.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Nejat Arinik; Vincent Labatut
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the data used in the experiment of the paper submited to the following conference:N. Arinik, R. Figueiredo, V. Labatut, Signed graph analysis for the interpretation of voting behavior, in: International Conference on Knowledge Technologies and Data-driven Business - International Workshop on Social Network Analysis and Digital Humanities, Graz, AT, 2017.URL http://ceur-ws.org/Vol-2025/paper_rssna_1.pdfThe code source is accessible here: https://github.com/CompNet/NetVotes# RAW INPUT FILESThe 'itsyourparliament' folder contains all raw input files for further data processing (such as network extraction).The folder structure is as follows:* itsyourparliament/** domains: There are 28 domain files. Each file corresponds to a domain (such as Agriculture, Economy, etc.) and contains corresponding vote identifiers and their "itsyourparliament.eu" links.** meps: There 870 Member of Parliament (MEP) files. Each file contains the MEP information (such as name, country, address, etc.)** votes: There are 7513 vote files. Each file contains the votes expressed by MEPs# NETWORKS AND CORRESPONDING PARTITIONSThis work studies the voting behavior of French and Italian MEPs on "Agriculture and Rural Development" (AGRI) and "Economic and Monetary Affairs" (ECON) for each separate year of the 7th EP term (2009-10, 2010-11, 2011-12, 2012-13, 2013-14). Note that the interpretation part (section 4) of the published paper are limited to only a few instances of them (2009-10 in ECON and 2012-13 in AGRI).The extracted networks are located in the "networks" folder and the corresponding partitions are in the "partitions" folder. Both folders has the same folder structure and it is as follows:COUNTRY-NAME|_DOMAIN-NAME|_2009-10|_2010-11|_2011-12|_2012-13|_2013-14## NETWORKSThe networks in this folder are used in the article. All those networks are the ones obtained after the filtering step (as explained in the article). The networks are in 'Graphml' format. These networks are enriched with some MEPs' properties (such as name, political party, etc.) associated with each node.## ALL NETWORKSFor those who are interested in other countries or domains, we make available all possible networks that we can extract from raw data with vs. without filtering step. COUNTRY-NAME |_m3 |_negtr=NA_postr=NA: This folder contains all filtered networks. Note that the filtering step is explained in Section 2.1.2 of the article. |_bygroup |_bycountry |_negtr=0_postr=0: This folder contains all original networks (i.e. no filtering step). |_bygroup |_bycountry## PARTITIONSThe partitions are obtained in this way: First, the Ex-CC (exact) method is run and we denote 'k' for the the number of detected cluster in output. This 'k' value is the reference point in order to run the ILS-RCC (heuristic) method by specifying the number of desired cluster in output. Then, ILS-RCC is run with various values ('k', 'k+1', 'k+2'). All those results are integrated into the initial network graphml files and then converted into gephi format so that this will help dive in the results in interactive way.Note that we need to handle the absent MEPs in clustering results. Because, those MEPs correspond to isolated nodes in networks. Each isolated node is considered a single cluster node in Ex-CC results. We simply omit those nodes in order to find the 'k' (number of detected cluster) value before running ILS-RCC. Not also that ILS-RCC does not process isolated nodes such that an isolated node can be part of a cluster.# COMPARISON RESULTSThe 'material-stats' folder contains all the comparison results obtained for Ex-CC and ILS-CC. The csv files associated with plots are also provided.The folder structure is as follows:* material-stats/** execTimePerf: The plot shows the execution time of Ex-CC and ILS-CC based on randomly generated complete networks of different size.** graphStructureAnalysis: The plots show the weights and links statistics for all instances.** ILS-CC-vs-Ex-CC: The folder contains 4 different comparisons between Ex-CC and ILS-CC: Imbalance difference, number of detected clusters, difference of the number of detected clusters, NMI (Normalized Mutual Information)

  12. Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation...

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jinseok Kim; Jenna Kim; Jason Owen-Smith (2023). Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.14043791.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Jinseok Kim; Jenna Kim; Jason Owen-Smith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data files for a research paper, "Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning," published in the Journal of the Association for Information Science and Technology.Four zipped files are uploaded.Each zipped file contains five data files: signatures_train.txt, signatures_test.txt, records.txt, clusters_train.txt, and clusters_test.txt.1. 'Signatures' files contain lists of name instances. Each name instance (a row) is associated with information as follows. - 1st column: instance id (numeric): unique id assigned to a name instance - 2nd column: paper id (numeric): unique id assigned to a paper in which the name instance appears as an author name - 3rd column: byline position (numeric): integer indicating the position of the name instance in the authorship byline of the paper - 4th column: author name (string): name string formatted as surname, comma, and forename(s) - 5th column: ethnic name group (string): name ethnicity assigned by Ethnea to the name instance - 6th column: affiliation (string): affiliation associated with the name instance, if available in the original data - 7th column: block (string): simplified name string of the name instance to indicate its block membership (surname and first forename initial) - 8th column: author id (string): unique author id (i.e., author label) assigned by the creators of the original data2. 'Records' files contain lists of papers. Each paper is associated with information as follows. -1st column: paper id (numeric): unique paper id; this is the unique paper id (2nd column) in Signatures files -2nd column: year (numeric): year of publication * Some papers may have wrong publication years due to incorrect indexing or delayed updates in original data -3rd column: venue (string): name of journal or conference in which the paper is published * Venue names can be in full string or in a shortened format according to the formats in original data -4th column: authors (string; separated by vertical bar): list of author names that appear in the paper's byline * Author names are formatted into surname, comma, and forename(s) -5th column: title words (string; separated by space): words in a title of the paper. * Note that common words are stop-listed and each remaining word is stemmed using Porter's stemmer.3. 'Clusters' files contain lists of clusters. Each cluster is associated with information as follows. -1st column: cluster id (numeric): unique id of a cluster -2nd column: list of name instance ids (Signatures - 1st column) that belong to the same unique author id (Signatures - 8th column). Signatures and Clusters files consist of two subsets - train and test files - of original labeled data which are randomly split into 50%-50% by the authors of this study.Original labeled data for AMiner.zip, KISTI.zip, and GESIS.zip came from the studies cited below.If you use one of the uploaded data files, please cite them accordingly.[AMiner.zip]Tang, J., Fong, A. C. M., Wang, B., & Zhang, J. (2012). A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975-987. doi:10.1109/Tkde.2011.13Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active Name Disambiguation. Paper presented at the 2011 IEEE 11th International Conference on Data Mining.[KISTI.zip]Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a Large-Scale Test Set for Author Disambiguation. Information Processing & Management, 47(3), 452-465. doi:10.1016/j.ipm.2010.10.001Note that the original KISTI data contain errors and duplicates. This study reuses the revised version of KISTI reported in a study below.Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867-1886. doi:10.1007/s11192-018-2824-5[GESIS.zip]Momeni, F., & Mayr, P. (2016). Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names. Paper presented at the 20th international Conference on Theory and Practice of Digital Libraries (TPDL 2016), Hannover, Germany.Note that this study reuses the 'Evaluation Set' among the original GESIS data which was added titles by a study below.Kim, J., & Kim, J. (2020). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology, 71(7), 839-855. doi:10.1002/asi.24298[UM-IRIS.zip]This labeled dataset was created for this study. For description about the labeling method, please see 'Method' in the paper below.Kim, J., Kim, J., & Owen-Smith, J. (In print). Ethnicity-based name partitioning for author name disambiguation using supervised machine learning. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24459.For details on the labeling method and limitations, see the paper below.Kim, J., & Owen-Smith, J. (2021). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6

  13. Space of optimal solutions of the Correlation Clustering problem on Complete...

    • figshare.com
    zip
    Updated Sep 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nejat Arinik; Vincent Labatut (2020). Space of optimal solutions of the Correlation Clustering problem on Complete Signed Graphs [Dataset]. http://doi.org/10.6084/m9.figshare.8233340.v5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 24, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Nejat Arinik; Vincent Labatut
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the data used in the experiments of our paper:N. Arinik, R. Figueiredo, V. Labatut (2020), Multiplicity and Diversity: Analyzing the Optimal Solution Space of the Correlation Clustering Problem on Complete Signed Graphs, Journal of Complex Networks, DOI: 10.1093/comnet/cnaa025. The code source is accessible here: https://github.com/CompNet/SosoccThis dataset contains:* Plot files used in the article* Input signed networks* All optimal solutions (i.e. optimal solution space) of the corresponding networks* Evaluation files# PLOT FILES* Figure1.zip: Figures showing that there might be many distinct optimal solutions of a small-sized network.* Figure2.zip: Figures showing that distinct optimal solutions of a given network might be partition-wise very similar or different.* Figure4: All Results.zip: Figure 4 in the article contains only a few plots regarding the results for space considerations. This zip file contains all plots, and it is organized by the values of l0. In each l0 folder, the results are shown in three different perspectives: --- Detected Imbalance Percentage vs Graph Order (i.e. number of vertices) --- Prop mispl vs Graph order --- Graph order vs Prop mispl* workflow.pdf: The workflow of the methodology used in the article.* Syrian network With All Solutions.pdf: Syrian network (on top) with core part information through node colors, and its optimal solutions in which node colors represent partition information (on bottom).#NETWORKSAll networks are in Input Signed Networks.tar.gz.Networks are generated through a simple random model (available in https://github.com/CompNet/SignedBenchmark) designed to produce complete (or uncomplete) unweighted networks with built-in modular structure. There are 3 parameters used for the generation:- number of nodes (n)- initial number of modules (l0)- proportion of misplaced links, i.e. proportion of frustrated links, (qm)Inside Input Signed Networks.tar.gz:NETWORKS|_n=NB-NODE_l0=INIT_NB_MODULE_dens=1.0000....|_propMispl=PROP_MISPL ........|_propNeg=PROP_NEG ............|_network=NETWORK_NO- The first hierarchy => the folders are named as follows: n=NB-NODE_l0=INIT-NB-MODULE_dens=1.0000 The number of nodes, the initial number of modules and the network density are given. The network density is always 1, since we treat only complete signed networks.- The second hierarchy => the folders are named as follows: propMispl=PROP_MISPL Proportion of misplaced links is given.- The third hierarchy => the folders are named as follows: propNeg=PROP_NEG Proportion of negative links (qn) is specified. qn changes depending on n and l0. Since only complete signed networks are studied, this parameter is automatically computed from the other input parameters.- The fourth hierarchy => the folders are named as follows: network=NETWORK_NO Network numbers are shown.In the end, thre are three file formats describing the same network content: GraphML (.graphml), Pajek NET (.net) or .G format.# PARTITIONSAll partition results are in Partition Results.tar.gz. Note that all optimal partitions of a signed network are obtained through an exact partitioning method. The code source is accessible here: https://github.com/arinik9/ExCCInside Partition Results.tar.gz:PARTITIONS|_n=NB-NODE_l0=INIT_NB_MODULE_dens=1.0000 ....|_propMispl=PROP_MISPL ........|_propNeg=PROP_NEG ............|_network=NETWORK_NO ................|_"ExCC-all" ....................|_"signed-unweighted"- The first hierarchy => the folders are named as follows: n=NB-NODE_l0=INIT-NB-MODULE_dens=1.0000- The second hierarchy => the folders are named as follows: propMispl=PROP_MISPL- The third hierarchy => the folders are named as follows: propNeg=PROP_NEG- The fourth hierarchy => the folders are named as follows: network=NETWORK_NO- The fifth hierarchy => the folders are named as follows: "ExCC-all" The name of the partitioning method are shown. Since an exact partitioning method is used to obtain all distinct optimal solutions, it is named as "ExCC-all".- The sixth hierarchy => the folders are named as follows: "signed-unweighted" The type of signed networks are shown: signed and unweightedIn the end, the partition results are located, and the file names are named as follows: membership.txt. Note that the first partition result number starts from zero.# EVALUATIONSEvaluation results related to our plots are in Evaluation Results.tar.gz. Note that the hierarchy of this folder is the same as that of 'Partitions'. InsideEvaluation Results.tar.gz:-Best-k-for-kmedoids.csv: It contains three columns. 1) the number of solution classes via kmedoids, 2) the best Silhouette score, 3) the best clustering in terms of Silhouette score, which represents solution classes.-class-core-part-size-tresh=1.00.csv. It indicates the proportion of core part size for each solution class.-exec-time.csv: It indicates the execution time in seconds.-imbalance.csv: It contains the information of imbalance as 1) count and 2) percentage -nb-solution.csv`: It indicates the total number of solutions

  14. f

    Number of PT/INR results anlayzed in different age groups.

    • plos.figshare.com
    xls
    Updated Jun 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Shariq Shaikh; Sibtain Ahmed (2023). Number of PT/INR results anlayzed in different age groups. [Dataset]. http://doi.org/10.1371/journal.pone.0276884.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Muhammad Shariq Shaikh; Sibtain Ahmed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of PT/INR results anlayzed in different age groups.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang (2023). Summary of model performance (median (Q1-Q3)) [Dataset]. http://doi.org/10.1371/journal.pone.0177811.t003
Organization logo

Summary of model performance (median (Q1-Q3))

Related Article
Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
xlsAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Summary of model performance (median (Q1-Q3))

Search
Clear search
Close search
Google apps
Main menu