14 datasets found

Summary of model performance (median (Q1-Q3))
figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang (2023). Summary of model performance (median (Q1-Q3)) [Dataset]. http://doi.org/10.1371/journal.pone.0177811.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0177811.t003
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of model performance (median (Q1-Q3))
Figure 6 and 7 from manuscript Sparsely-Connected Autoencoder (SCA) for...
figshare.com
zip
Updated Aug 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raffaele Calogero (2020). Figure 6 and 7 from manuscript Sparsely-Connected Autoencoder (SCA) for single cell RNAseq data mining [Dataset]. http://doi.org/10.6084/m9.figshare.12866897.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12866897.v1
Dataset updated
Aug 26, 2020
Dataset provided by
Figsharehttp://figshare.com/
Authors
Raffaele Calogero
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset used to generate figure 6 and 7.Figure 6: Analysis of human breast cancer (Block A Section 1), from 10XGenomics Visium Spatial Gene Expression 1.0.0. demonstration samples. A) SIMLR partitioning in 9 clusters (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf). B) Cell stability score plot for SIMLR clusters in A (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf. C) SIMLR clusters location in the tissue section (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_spatial_Stability.pdf). D) Hematoxylin and eosin image (figure6and7/HBC_BAS1/spatial/V1_Breast_Cancer_Block_A_Section_1_image.tif).Figure 6: Analysis of human breast cancer (Block A Section 1), from 10XGenomics Visium Spatial Gene Expression 1.0.0. demonstration samples. A) SIMLR partitioning in 9 clusters (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf). B) Cell stability score plot for SIMLR clusters in A (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf. C) SIMLR clusters location in the tissue section (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_spatial_Stability.pdf). D) Hematoxylin and eosin image (figure6and7/HBC_BAS1/spatial/V1_Breast_Cancer_Block_A_Section_1_image.tif).Figure 7: Information contents extracted by SCA analysis using a TF-based latent space. A) QCC (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_stabilityPlot.pdf). B) QCM (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_stabilityPlotUNBIAS.pdf). C) QCM/QCC plot, where only cluster 7 show, for the majority of the cells, both QCC and QCM greater than 0.5 (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_StabilitySignificativityJittered.pdf). D) COMET analysis of SCA latent space. SOX5 was detected as first top ranked gene specific for cluster 7, using as input for COMET the latent space frequency table (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/outputvis/cluster_7_singleton/rank_1.png). Input counts table for SCA analysis is made by raw counts.
H
Data from: Graph Regionalization with Clustering and Partitioning: an...
dataverse.harvard.edu
search.dataone.org
Updated Sep 23, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BENASSI FEDERICO (2015). Graph Regionalization with Clustering and Partitioning: an Application for Daily Commuting Flows in Albania [Dataset]. http://doi.org/10.7910/DVN/3AVOGY
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/3AVOGY
Dataset updated
Sep 23, 2015
Dataset provided by
Harvard Dataverse
Authors
BENASSI FEDERICO
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Albania
Description
The paper presents an original application of the recently proposed spatial data mining method named GraphRECAP on daily commuting flows using 2011 Albanian census data. Its aim is to identify several clusters of Albanian municipalities/communes; propose a classification of the Albanian territory based on daily commuting flows among municipalities/communes. Starting from 373 local units, we first applied a spatial clustering technique without imposing any constraining strategy. Based on the input variables, we obtained 16 clusters. In the second step of our analysis, we impose a set of constraining parameters to identify intermediate areas between the local level (municipality/commune) and the national one. We have defined 12 derived regions (same number as the actual Albanian prefectures but with different geographies). These derived regions are quite different from the traditional ones in terms of both geographical dimensions and boundaries.
n
Malaria disease and grading system dataset from public hospitals reflecting...
data.niaid.nih.gov
datadryad.org
zip
Updated Nov 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.4xgxd25gn
Dataset updated
Nov 10, 2023
Dataset provided by
Nasarawa State University
Authors
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
Performance comparison of the three classifiers on the CHD data.
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang (2023). Performance comparison of the three classifiers on the CHD data. [Dataset]. http://doi.org/10.1371/journal.pone.0177811.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0177811.t002
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparison of the three classifiers on the CHD data.
Data from: Red Wine Quality
kaggle.com
zip
Updated Nov 27, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning (2017). Red Wine Quality [Dataset]. https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
Explore at:
zip(26176 bytes)Available download formats
Dataset updated
Nov 27, 2017
Dataset authored and provided by
UCI Machine Learning
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (If I am mistaken and the public license type disallowed me from doing so, I will take this down if requested.)

Content

For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)

Tips

What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm)

KNIME is a great tool (GUI) that can be used for this.
1 - File Reader (for csv) to linear correlation node and to interactive histogram for basic EDA.
2- File Reader to 'Rule Engine Node' to turn the 10 point scale to dichtome variable (good wine and rest), the code to put in the rule engine is something like this:
- $quality$ > 6.5 => "good"
- TRUE => "bad"
3- Rule Engine Node output to input of Column Filter node to filter out your original 10point feature (this prevent leaking)
4- Column Filter Node output to input of Partitioning Node (your standard train/tes split, e.g. 75%/25%, choose 'random' or 'stratified')
5- Partitioning Node train data split output to input of Train data split to input Decision Tree Learner node and
6- Partitioning Node test data split output to input Decision Tree predictor Node
7- Decision Tree learner Node output to input Decision Tree Node input
8- Decision Tree output to input ROC Node.. (here you can evaluate your model base on AUC value)

Inspiration

Use machine learning to determine which physiochemical properties make a wine 'good'!

Acknowledgements

This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (I am mistaken and the public license type disallowed me from doing so, I will take this down at first request. I am not the owner of this dataset.

Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Relevant publication

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
The input data set includes 729 objects (patients) and 39 variables...
zenodo.org
csv, png +1
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miroslava Nedyalkova; Miroslava Nedyalkova (2024). The input data set includes 729 objects (patients) and 39 variables (clinical qualitative and quantitative descriptors). [Dataset]. http://doi.org/10.5281/zenodo.6652207
Explore at:
png, csv, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6652207
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Miroslava Nedyalkova; Miroslava Nedyalkova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
For reliable data treatment and interpretation qualitative descriptors were omitted and only numerical clinical indicators were included in the data matrix. Finally, the data set dimension was [729 x 18].

The data were treated by hierarchical cluster analysis and factor analysis. The major goal of the data mining was to reach statistically significant partitioning of the objects and variables into similarity patterns (clusters) which helps to better understand the data structure, to assess the meaning of the partitioning achieved, thus promoting the evaluation of the health status of the patients and the role of specific descriptors for the formation of the partitioning patterns.

3D classification Python tool.
Description of nine indicator variables.
plos.figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang (2023). Description of nine indicator variables. [Dataset]. http://doi.org/10.1371/journal.pone.0177811.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0177811.t001
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description of nine indicator variables.
Soil and Landscape Grid Digital Soil Property Maps for Tasmania (3"...
researchdata.edu.au
data.csiro.au
datadownload
Updated Nov 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex McBratney; Budiman Minasny; Brendan Malone; Mathew Webb; Darren Kidd (2022). Soil and Landscape Grid Digital Soil Property Maps for Tasmania (3" resolution) [Dataset]. http://doi.org/10.4225/08/5AAF364C54CC8
Explore at:
datadownloadAvailable download formats
Unique identifier
https://doi.org/10.4225/08/5AAF364C54CC8
Dataset updated
Nov 24, 2022
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Alex McBratney; Budiman Minasny; Brendan Malone; Mathew Webb; Darren Kidd
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1947 - Sep 30, 2014
Area covered

Description
These are the soil attribute products of the Tasmanian Soil Attribute Grids. There are 8 soil attribute products available from the TERN Soil Facility. Each soil attribute product is a collection of 6 depth slices. Each depth raster has an upper and lower uncertainty limit raster associated with it. The depths provided are 0-5cm, 5-15cm, 15-30cm, 30-60cm, 60-100cm & 100-200cm, consistent with the Specifications of the GlobalSoilMap.

Attributes: pH - Water (pHw); Electical Conductivity dS/m (ECD); Clay % (CLY); Sand % (SND); Silt % (SLT); Bulk Density - Whole Earth Mg/m3 (BDw); Organic Carbon % (SOC); Coarse Fragments >2mm (CFG).

These products were developed using datasets held by the Tasmanian Department of Primary Industries Parks Water & Environment (DPIPWE) Soils Database. The mapping was made by using spatial modelling and digital soil mapping (DSM) techniques to produce a fine resolution 3 arc-second grid of soil attribute values and their uncertainties, across all of Tasmania.

Note: Previous versions of this collection contained a Depth layer. This has been removed as the units do not comply with Global Soil Map specifications. Lineage: The soil attribute maps are generated using spatial modelling and digital soil mapping techniques.

Soil inventory:

Tasmanian soil site data originates from the DPIPWE soils database, a compilation of various historical soil surveys undertaken by DPIPWE, CSIRO, Forestry Tasmania and the University of Tasmania. This database contains morphological and laboratory data for all the soil sites.

Data Modelling :

A raster stack of all covariates was generated and the target variable (each soil property and depth) individually intersected with the covariate values to provide the calibration and validation data. All modelling was undertaken in ‘R’ (R Development Core Team 2012), using Regression tree (RT), specifically the Cubist R package (Kuhn, Weston et al. 2012; Kuhn, Weston et al. 2013; Quinlan 2005). The RT approach is a popular modelling approach for many disciplines (Breiman, Friedman et al. 1984), and has been widely used with DSM (Grunwald 2009; Kidd, Malone et al. 2014; McKenzie and Ryan 1999). Cubist develops the regression trees by first applying a data mining-approach to partition the calibration and explanatory covariate values into a set of structured ‘classifier’ data. The tree structure is developed by repeatedly partitioning the data into linear models until no significant measure of difference in the calibration data is determined (McBratney, Mendonça Santos et al. 2003). A series of covariate-based rules (conditions) is developed, and the linear model corresponding to the covariate conditions is applied to produce the final modelled surface. For this modelling exercise, the number of rules was set within the model controls to let the Cubist algorithm decide upon the optimum number of rules to generate.

Uncertainty Leave-one-out-cross-validation (LOOCV) was applied to the Cubist model to generate rule-based uncertainties, using only those covariates forming the conditional partitioning of that rule, following Malone et al (2014). The LOOCV, applied to an individual Cubist model for each rule, effectively produced a mean value for each RT partition, with the upper and lower 5 and 95% quantiles of the prediction variation providing the lower and upper prediction uncertainty values respectively, at the 90% Prediction Interval (PI). A 10-fold cross validation was used to run this process 10 times across all data to produce mean modelling diagnostics and validations, and reduce modelling bias due to sensitivity to training data variance.
Learning optimal solution characteristics of the Correlation Clustering...
figshare.com
application/gzip
Updated Mar 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nejat Arinik; Vincent Labatut (2022). Learning optimal solution characteristics of the Correlation Clustering problem [Dataset]. http://doi.org/10.6084/m9.figshare.19350284.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19350284.v2
Dataset updated
Mar 14, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Nejat Arinik; Vincent Labatut
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is used for studying the space of optimal solutions of the Correlation Clustering problem. It contains both complete and incomplete signed networks, as well as their spaces of optimal solutions.
NetVotes 2017 - iKnow’17
figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nejat Arinik; Vincent Labatut (2023). NetVotes 2017 - iKnow’17 [Dataset]. http://doi.org/10.6084/m9.figshare.5785833.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5785833.v3
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Nejat Arinik; Vincent Labatut
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the data used in the experiment of the paper submited to the following conference:N. Arinik, R. Figueiredo, V. Labatut, Signed graph analysis for the interpretation of voting behavior, in: International Conference on Knowledge Technologies and Data-driven Business - International Workshop on Social Network Analysis and Digital Humanities, Graz, AT, 2017.URL http://ceur-ws.org/Vol-2025/paper_rssna_1.pdfThe code source is accessible here: https://github.com/CompNet/NetVotes# RAW INPUT FILESThe 'itsyourparliament' folder contains all raw input files for further data processing (such as network extraction).The folder structure is as follows:* itsyourparliament/** domains: There are 28 domain files. Each file corresponds to a domain (such as Agriculture, Economy, etc.) and contains corresponding vote identifiers and their "itsyourparliament.eu" links.** meps: There 870 Member of Parliament (MEP) files. Each file contains the MEP information (such as name, country, address, etc.)** votes: There are 7513 vote files. Each file contains the votes expressed by MEPs# NETWORKS AND CORRESPONDING PARTITIONSThis work studies the voting behavior of French and Italian MEPs on "Agriculture and Rural Development" (AGRI) and "Economic and Monetary Affairs" (ECON) for each separate year of the 7th EP term (2009-10, 2010-11, 2011-12, 2012-13, 2013-14). Note that the interpretation part (section 4) of the published paper are limited to only a few instances of them (2009-10 in ECON and 2012-13 in AGRI).The extracted networks are located in the "networks" folder and the corresponding partitions are in the "partitions" folder. Both folders has the same folder structure and it is as follows:COUNTRY-NAME|_DOMAIN-NAME|_2009-10|_2010-11|_2011-12|_2012-13|_2013-14## NETWORKSThe networks in this folder are used in the article. All those networks are the ones obtained after the filtering step (as explained in the article). The networks are in 'Graphml' format. These networks are enriched with some MEPs' properties (such as name, political party, etc.) associated with each node.## ALL NETWORKSFor those who are interested in other countries or domains, we make available all possible networks that we can extract from raw data with vs. without filtering step. COUNTRY-NAME |_m3 |_negtr=NA_postr=NA: This folder contains all filtered networks. Note that the filtering step is explained in Section 2.1.2 of the article. |_bygroup |_bycountry |_negtr=0_postr=0: This folder contains all original networks (i.e. no filtering step). |_bygroup |_bycountry## PARTITIONSThe partitions are obtained in this way: First, the Ex-CC (exact) method is run and we denote 'k' for the the number of detected cluster in output. This 'k' value is the reference point in order to run the ILS-RCC (heuristic) method by specifying the number of desired cluster in output. Then, ILS-RCC is run with various values ('k', 'k+1', 'k+2'). All those results are integrated into the initial network graphml files and then converted into gephi format so that this will help dive in the results in interactive way.Note that we need to handle the absent MEPs in clustering results. Because, those MEPs correspond to isolated nodes in networks. Each isolated node is considered a single cluster node in Ex-CC results. We simply omit those nodes in order to find the 'k' (number of detected cluster) value before running ILS-RCC. Not also that ILS-RCC does not process isolated nodes such that an isolated node can be part of a cluster.# COMPARISON RESULTSThe 'material-stats' folder contains all the comparison results obtained for Ex-CC and ILS-CC. The csv files associated with plots are also provided.The folder structure is as follows:* material-stats/** execTimePerf: The plot shows the execution time of Ex-CC and ILS-CC based on randomly generated complete networks of different size.** graphStructureAnalysis: The plots show the weights and links statistics for all instances.** ILS-CC-vs-Ex-CC: The folder contains 4 different comparisons between Ex-CC and ILS-CC: Imbalance difference, number of detected clusters, difference of the number of detected clusters, NMI (Normalized Mutual Information)
Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation...
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinseok Kim; Jenna Kim; Jason Owen-Smith (2023). Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.14043791.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14043791.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Jinseok Kim; Jenna Kim; Jason Owen-Smith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data files for a research paper, "Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning," published in the Journal of the Association for Information Science and Technology.Four zipped files are uploaded.Each zipped file contains five data files: signatures_train.txt, signatures_test.txt, records.txt, clusters_train.txt, and clusters_test.txt.1. 'Signatures' files contain lists of name instances. Each name instance (a row) is associated with information as follows. - 1st column: instance id (numeric): unique id assigned to a name instance - 2nd column: paper id (numeric): unique id assigned to a paper in which the name instance appears as an author name - 3rd column: byline position (numeric): integer indicating the position of the name instance in the authorship byline of the paper - 4th column: author name (string): name string formatted as surname, comma, and forename(s) - 5th column: ethnic name group (string): name ethnicity assigned by Ethnea to the name instance - 6th column: affiliation (string): affiliation associated with the name instance, if available in the original data - 7th column: block (string): simplified name string of the name instance to indicate its block membership (surname and first forename initial) - 8th column: author id (string): unique author id (i.e., author label) assigned by the creators of the original data2. 'Records' files contain lists of papers. Each paper is associated with information as follows. -1st column: paper id (numeric): unique paper id; this is the unique paper id (2nd column) in Signatures files -2nd column: year (numeric): year of publication * Some papers may have wrong publication years due to incorrect indexing or delayed updates in original data -3rd column: venue (string): name of journal or conference in which the paper is published * Venue names can be in full string or in a shortened format according to the formats in original data -4th column: authors (string; separated by vertical bar): list of author names that appear in the paper's byline * Author names are formatted into surname, comma, and forename(s) -5th column: title words (string; separated by space): words in a title of the paper. * Note that common words are stop-listed and each remaining word is stemmed using Porter's stemmer.3. 'Clusters' files contain lists of clusters. Each cluster is associated with information as follows. -1st column: cluster id (numeric): unique id of a cluster -2nd column: list of name instance ids (Signatures - 1st column) that belong to the same unique author id (Signatures - 8th column). Signatures and Clusters files consist of two subsets - train and test files - of original labeled data which are randomly split into 50%-50% by the authors of this study.Original labeled data for AMiner.zip, KISTI.zip, and GESIS.zip came from the studies cited below.If you use one of the uploaded data files, please cite them accordingly.[AMiner.zip]Tang, J., Fong, A. C. M., Wang, B., & Zhang, J. (2012). A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975-987. doi:10.1109/Tkde.2011.13Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active Name Disambiguation. Paper presented at the 2011 IEEE 11th International Conference on Data Mining.[KISTI.zip]Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a Large-Scale Test Set for Author Disambiguation. Information Processing & Management, 47(3), 452-465. doi:10.1016/j.ipm.2010.10.001Note that the original KISTI data contain errors and duplicates. This study reuses the revised version of KISTI reported in a study below.Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867-1886. doi:10.1007/s11192-018-2824-5[GESIS.zip]Momeni, F., & Mayr, P. (2016). Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names. Paper presented at the 20th international Conference on Theory and Practice of Digital Libraries (TPDL 2016), Hannover, Germany.Note that this study reuses the 'Evaluation Set' among the original GESIS data which was added titles by a study below.Kim, J., & Kim, J. (2020). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology, 71(7), 839-855. doi:10.1002/asi.24298[UM-IRIS.zip]This labeled dataset was created for this study. For description about the labeling method, please see 'Method' in the paper below.Kim, J., Kim, J., & Owen-Smith, J. (In print). Ethnicity-based name partitioning for author name disambiguation using supervised machine learning. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24459.For details on the labeling method and limitations, see the paper below.Kim, J., & Owen-Smith, J. (2021). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6
Space of optimal solutions of the Correlation Clustering problem on Complete...
figshare.com
zip
Updated Sep 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nejat Arinik; Vincent Labatut (2020). Space of optimal solutions of the Correlation Clustering problem on Complete Signed Graphs [Dataset]. http://doi.org/10.6084/m9.figshare.8233340.v5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8233340.v5
Dataset updated
Sep 24, 2020
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Nejat Arinik; Vincent Labatut
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the data used in the experiments of our paper:N. Arinik, R. Figueiredo, V. Labatut (2020), Multiplicity and Diversity: Analyzing the Optimal Solution Space of the Correlation Clustering Problem on Complete Signed Graphs, Journal of Complex Networks, DOI: 10.1093/comnet/cnaa025. The code source is accessible here: https://github.com/CompNet/SosoccThis dataset contains:* Plot files used in the article* Input signed networks* All optimal solutions (i.e. optimal solution space) of the corresponding networks* Evaluation files# PLOT FILES* Figure1.zip: Figures showing that there might be many distinct optimal solutions of a small-sized network.* Figure2.zip: Figures showing that distinct optimal solutions of a given network might be partition-wise very similar or different.* Figure4: All Results.zip: Figure 4 in the article contains only a few plots regarding the results for space considerations. This zip file contains all plots, and it is organized by the values of l0. In each l0 folder, the results are shown in three different perspectives: --- Detected Imbalance Percentage vs Graph Order (i.e. number of vertices) --- Prop mispl vs Graph order --- Graph order vs Prop mispl* workflow.pdf: The workflow of the methodology used in the article.* Syrian network With All Solutions.pdf: Syrian network (on top) with core part information through node colors, and its optimal solutions in which node colors represent partition information (on bottom).#NETWORKSAll networks are in Input Signed Networks.tar.gz.Networks are generated through a simple random model (available in https://github.com/CompNet/SignedBenchmark) designed to produce complete (or uncomplete) unweighted networks with built-in modular structure. There are 3 parameters used for the generation:- number of nodes (n)- initial number of modules (l0)- proportion of misplaced links, i.e. proportion of frustrated links, (qm)Inside Input Signed Networks.tar.gz:NETWORKS|_n=NB-NODE_l0=INIT_NB_MODULE_dens=1.0000....|_propMispl=PROP_MISPL ........|_propNeg=PROP_NEG ............|_network=NETWORK_NO- The first hierarchy => the folders are named as follows: n=NB-NODE_l0=INIT-NB-MODULE_dens=1.0000 The number of nodes, the initial number of modules and the network density are given. The network density is always 1, since we treat only complete signed networks.- The second hierarchy => the folders are named as follows: propMispl=PROP_MISPL Proportion of misplaced links is given.- The third hierarchy => the folders are named as follows: propNeg=PROP_NEG Proportion of negative links (qn) is specified. qn changes depending on n and l0. Since only complete signed networks are studied, this parameter is automatically computed from the other input parameters.- The fourth hierarchy => the folders are named as follows: network=NETWORK_NO Network numbers are shown.In the end, thre are three file formats describing the same network content: GraphML (.graphml), Pajek NET (.net) or .G format.# PARTITIONSAll partition results are in Partition Results.tar.gz. Note that all optimal partitions of a signed network are obtained through an exact partitioning method. The code source is accessible here: https://github.com/arinik9/ExCCInside Partition Results.tar.gz:PARTITIONS|_n=NB-NODE_l0=INIT_NB_MODULE_dens=1.0000 ....|_propMispl=PROP_MISPL ........|_propNeg=PROP_NEG ............|_network=NETWORK_NO ................|_"ExCC-all" ....................|_"signed-unweighted"- The first hierarchy => the folders are named as follows: n=NB-NODE_l0=INIT-NB-MODULE_dens=1.0000- The second hierarchy => the folders are named as follows: propMispl=PROP_MISPL- The third hierarchy => the folders are named as follows: propNeg=PROP_NEG- The fourth hierarchy => the folders are named as follows: network=NETWORK_NO- The fifth hierarchy => the folders are named as follows: "ExCC-all" The name of the partitioning method are shown. Since an exact partitioning method is used to obtain all distinct optimal solutions, it is named as "ExCC-all".- The sixth hierarchy => the folders are named as follows: "signed-unweighted" The type of signed networks are shown: signed and unweightedIn the end, the partition results are located, and the file names are named as follows: membership.txt. Note that the first partition result number starts from zero.# EVALUATIONSEvaluation results related to our plots are in Evaluation Results.tar.gz. Note that the hierarchy of this folder is the same as that of 'Partitions'. InsideEvaluation Results.tar.gz:-Best-k-for-kmedoids.csv: It contains three columns. 1) the number of solution classes via kmedoids, 2) the best Silhouette score, 3) the best clustering in terms of Silhouette score, which represents solution classes.-class-core-part-size-tresh=1.00.csv. It indicates the proportion of core part size for each solution class.-exec-time.csv: It indicates the execution time in seconds.-imbalance.csv: It contains the information of imbalance as 1) count and 2) percentage -nb-solution.csv`: It indicates the total number of solutions
f
Number of PT/INR results anlayzed in different age groups.
plos.figshare.com
xls
Updated Jun 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Shariq Shaikh; Sibtain Ahmed (2023). Number of PT/INR results anlayzed in different age groups. [Dataset]. http://doi.org/10.1371/journal.pone.0276884.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0276884.t001
Dataset updated
Jun 13, 2023
Dataset provided by
PLOS ONE
Authors
Muhammad Shariq Shaikh; Sibtain Ahmed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of PT/INR results anlayzed in different age groups.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang (2023). Summary of model performance (median (Q1-Q3)) [Dataset]. http://doi.org/10.1371/journal.pone.0177811.t003

Summary of model performance (median (Q1-Q3))

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0177811.t003

Dataset updated

Jun 1, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Yanhong Luo; Zhi Li; Husheng Guo; Hongyan Cao; Chunying Song; Xingping Guo; Yanbo Zhang

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Summary of model performance (median (Q1-Q3))

Clear search

Close search

Google apps

Main menu

Summary of model performance (median (Q1-Q3))

Figure 6 and 7 from manuscript Sparsely-Connected Autoencoder (SCA) for...

Data from: Graph Regionalization with Clustering and Partitioning: an...

Malaria disease and grading system dataset from public hospitals reflecting...

Performance comparison of the three classifiers on the CHD data.

Data from: Red Wine Quality

Context

Content

Tips

Inspiration

Acknowledgements

Relevant publication

The input data set includes 729 objects (patients) and 39 variables...

Description of nine indicator variables.

Soil and Landscape Grid Digital Soil Property Maps for Tasmania (3"...

Learning optimal solution characteristics of the Correlation Clustering...

NetVotes 2017 - iKnow’17

Dataset: Ethnicity-Based Name Partitioning for Author Name Disambiguation...

Space of optimal solutions of the Correlation Clustering problem on Complete...

Number of PT/INR results anlayzed in different age groups.

Summary of model performance (median (Q1-Q3))