13 datasets found
  1. f

    Data from: Data-Driven Approach Considering Imbalance in Data Sets and...

    • acs.figshare.com
    zip
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wataru Takahara; Ryuto Baba; Yosuke Harashima; Tomoaki Takayama; Shogo Takasuka; Yuichi Yamaguchi; Akihiko Kudo; Mikiya Fujii (2025). Data-Driven Approach Considering Imbalance in Data Sets and Experimental Conditions for Exploration of Photocatalysts [Dataset]. http://doi.org/10.1021/acsomega.4c06997.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    ACS Publications
    Authors
    Wataru Takahara; Ryuto Baba; Yosuke Harashima; Tomoaki Takayama; Shogo Takasuka; Yuichi Yamaguchi; Akihiko Kudo; Mikiya Fujii
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In the field of data-driven material development, an imbalance in data sets where data points are concentrated in certain regions often causes difficulties in building regression models when machine learning methods are applied. One example of inorganic functional materials facing such difficulties is photocatalysts. Therefore, advanced data-driven approaches are expected to help efficiently develop novel photocatalytic materials even if an imbalance exists in data sets. We propose a two-stage machine learning model aimed at handling imbalanced data sets without data thinning. In this study, we used two types of data sets that exhibit the imbalance: the Materials Project data set (openly shared due to its public domain data) and the in-house metal-sulfide photocatalyst data set (not openly shared due to the confidentiality of experimental data). This two-stage machine learning model consists of the following two parts: the first regression model, which predicts the target quantitatively, and the second classification model, which determines the reliability of the values predicted by the first regression model. We also propose a search scheme for variables related to the experimental conditions based on the proposed two-stage machine learning model. This scheme is designed for photocatalyst exploration, taking experimental conditions into account as the optimal set of variables for these conditions is unknown. The proposed two-stage machine learning model improves the prediction accuracy of the target compared with that of the one-stage model.

  2. Regression of articles’ imbalance (square-root-transformed) on relevant...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jens Jirschitzka; Joachim Kimmerle; Iassen Halatchliyski; Julia Hancke; Detmar Meurers; Ulrike Cress (2023). Regression of articles’ imbalance (square-root-transformed) on relevant predictors and their interactions with the dummy-coded direction of article polarity (estimates in parentheses result if the dummy variable gets a value of zero for conventional perspectives). [Dataset]. http://doi.org/10.1371/journal.pone.0178985.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jens Jirschitzka; Joachim Kimmerle; Iassen Halatchliyski; Julia Hancke; Detmar Meurers; Ulrike Cress
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Regression of articles’ imbalance (square-root-transformed) on relevant predictors and their interactions with the dummy-coded direction of article polarity (estimates in parentheses result if the dummy variable gets a value of zero for conventional perspectives).

  3. Sample size (n) of the full dataset generated under each class-imbalance...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khurram Nadeem; Mehdi-Abderrahman Jabri (2023). Sample size (n) of the full dataset generated under each class-imbalance ratio (IR) to achieve a target balanced sample size (nb). [Dataset]. http://doi.org/10.1371/journal.pone.0280258.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Khurram Nadeem; Mehdi-Abderrahman Jabri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample size (n) of the full dataset generated under each class-imbalance ratio (IR) to achieve a target balanced sample size (nb).

  4. c

    Student Performance Dataset

    • cubig.ai
    zip
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Student Performance Dataset [Dataset]. https://cubig.ai/store/products/358/student-performance-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 28, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
    Description

    1) Data Introduction • The Student Performance Dataset is a survey of secondary school mathematics students and is a dataset containing a variety of information in a table format, including student demographics, family environment, parents' education and occupation, health, family relationships, and grades.

    2) Data Utilization (1) Student Performance Dataset has characteristics that: • Each row contains a total of 33 different characteristics, including school ID, gender, age, family size, parents' educational level and occupation, family relationship, health status, and grades. • It is suitable for a variety of data analysis and prediction exercises, including regression analysis and categorical variable imbalance analysis, including the target variable Grade. (2) Student Performance Dataset can be used to: • Analyzing academic achievement prediction and influencing factors: It can be used to analyze the impact of various factors such as student's background, family environment, and parental characteristics on grades and to develop a grade prediction model. • Establishing educational policies and customized support strategies: Based on student-specific characteristics and grade data, it can be applied to establishing educational policies such as closing educational gaps, supporting vulnerable student groups, and providing customized learning guidance.

  5. Regression model Input variables and resulting regression coefficients by...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joe Alexander Jr.; Roger A. Edwards; Marina Brodsky; Luigi Manca; Roberto Grugni; Alberto Savoldelli; Gianluca Bonfanti; Birol Emir; Ed Whalen; Steve Watt; Bruce Parsons (2023). Regression model Input variables and resulting regression coefficients by cluster for the calibration dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0207120.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Joe Alexander Jr.; Roger A. Edwards; Marina Brodsky; Luigi Manca; Roberto Grugni; Alberto Savoldelli; Gianluca Bonfanti; Birol Emir; Ed Whalen; Steve Watt; Bruce Parsons
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Regression model Input variables and resulting regression coefficients by cluster for the calibration dataset.

  6. f

    Using time series analysis approaches for improved prediction of pain...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joe Alexander Jr.; Roger A. Edwards; Marina Brodsky; Luigi Manca; Roberto Grugni; Alberto Savoldelli; Gianluca Bonfanti; Birol Emir; Ed Whalen; Steve Watt; Bruce Parsons (2023). Using time series analysis approaches for improved prediction of pain outcomes in subgroups of patients with painful diabetic peripheral neuropathy [Dataset]. http://doi.org/10.1371/journal.pone.0207120
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Joe Alexander Jr.; Roger A. Edwards; Marina Brodsky; Luigi Manca; Roberto Grugni; Alberto Savoldelli; Gianluca Bonfanti; Birol Emir; Ed Whalen; Steve Watt; Bruce Parsons
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Prior work applied hierarchical clustering, coarsened exact matching (CEM), time series regressions with lagged variables as inputs, and microsimulation to data from three randomized clinical trials (RCTs) and a large German observational study (OS) to predict pregabalin pain reduction outcomes for patients with painful diabetic peripheral neuropathy. Here, data were added from six RCTs to reduce covariate bias of the same OS and improve accuracy and/or increase the variety of patients for pain response prediction. Using hierarchical cluster analysis and CEM, a matched dataset was created from the OS (N = 2642) and nine total RCTs (N = 1320). Using a maximum likelihood method, we estimated weekly pain scores for pregabalin-treated patients for each cluster (matched dataset); the models were validated with RCT data that did not match with OS data. We predicted novel ‘virtual’ patient pain scores over time using simulations including instance-based machine learning techniques to assign novel patients to a cluster, then applying cluster-specific regressions to predict pain response trajectories. Six clusters were identified according to baseline variables (gender, age, insulin use, body mass index, depression history, pregabalin monotherapy, prior gabapentin, pain score, and pain-related sleep interference score). CEM yielded 1766 patients (matched dataset) having lower covariate imbalances. Regression models for pain performed well (adjusted R-squared 0.90–0.93; root mean square errors 0.41–0.48). Simulations showed positive predictive values for achieving >50% and >30% change-from-baseline pain score improvements (range 68.6–83.8% and 86.5–93.9%, respectively). Using more RCTs (nine vs. the earlier three) enabled matching of 46.7% more patients in the OS dataset, with substantially reduced global imbalance vs. not matching. This larger RCT pool covered 66.8% of possible patient characteristic combinations (vs. 25.0% with three original RCTs) and made prediction possible for a broader spectrum of patients.Trial Registration: www.clinicaltrials.gov (as applicable): NCT00156078, NCT00159679, NCT00143156, NCT00553475.

  7. Statistical comparison of clusters within the matched dataset, within the...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joe Alexander Jr.; Roger A. Edwards; Marina Brodsky; Luigi Manca; Roberto Grugni; Alberto Savoldelli; Gianluca Bonfanti; Birol Emir; Ed Whalen; Steve Watt; Bruce Parsons (2023). Statistical comparison of clusters within the matched dataset, within the validation dataset, and between the matched and validation datasetsa. [Dataset]. http://doi.org/10.1371/journal.pone.0207120.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Joe Alexander Jr.; Roger A. Edwards; Marina Brodsky; Luigi Manca; Roberto Grugni; Alberto Savoldelli; Gianluca Bonfanti; Birol Emir; Ed Whalen; Steve Watt; Bruce Parsons
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistical comparison of clusters within the matched dataset, within the validation dataset, and between the matched and validation datasetsa.

  8. Summary of patients from RCTs included in virtual Lab 2.0 by maintenance...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joe Alexander Jr.; Roger A. Edwards; Marina Brodsky; Luigi Manca; Roberto Grugni; Alberto Savoldelli; Gianluca Bonfanti; Birol Emir; Ed Whalen; Steve Watt; Bruce Parsons (2023). Summary of patients from RCTs included in virtual Lab 2.0 by maintenance dose. [Dataset]. http://doi.org/10.1371/journal.pone.0207120.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Joe Alexander Jr.; Roger A. Edwards; Marina Brodsky; Luigi Manca; Roberto Grugni; Alberto Savoldelli; Gianluca Bonfanti; Birol Emir; Ed Whalen; Steve Watt; Bruce Parsons
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of patients from RCTs included in virtual Lab 2.0 by maintenance dose.

  9. Precision medicine knowledge regression model results.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohini Chakravarthy; Sarah C. Stallings; Michael Williams; Megan Hollister; Mario Davidson; Juan Canedo; Consuelo H. Wilkins (2023). Precision medicine knowledge regression model results. [Dataset]. http://doi.org/10.1371/journal.pone.0234833.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rohini Chakravarthy; Sarah C. Stallings; Michael Williams; Megan Hollister; Mario Davidson; Juan Canedo; Consuelo H. Wilkins
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Precision medicine knowledge regression model results.

  10. Trust regression models.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohini Chakravarthy; Sarah C. Stallings; Michael Williams; Megan Hollister; Mario Davidson; Juan Canedo; Consuelo H. Wilkins (2023). Trust regression models. [Dataset]. http://doi.org/10.1371/journal.pone.0234833.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rohini Chakravarthy; Sarah C. Stallings; Michael Williams; Megan Hollister; Mario Davidson; Juan Canedo; Consuelo H. Wilkins
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Trust regression models.

  11. GMM regression of Eq (38) (dependent variable: lnrpgdp).

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yimin Chen; Yulin Liu; Xin Fang (2023). GMM regression of Eq (38) (dependent variable: lnrpgdp). [Dataset]. http://doi.org/10.1371/journal.pone.0257456.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yimin Chen; Yulin Liu; Xin Fang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GMM regression of Eq (38) (dependent variable: lnrpgdp).

  12. f

    Two-stage least squares (2SLS) instrumental variable method regression...

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jan 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lan Mao (2025). Two-stage least squares (2SLS) instrumental variable method regression results. [Dataset]. http://doi.org/10.1371/journal.pone.0317537.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 24, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Lan Mao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Two-stage least squares (2SLS) instrumental variable method regression results.

  13. Variables and data resources in the study.

    • plos.figshare.com
    xls
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mandana Rezaeiahari; Clare C. Brown; Mir M. Ali; Jyotishka Datta; J. Mick Tilford (2023). Variables and data resources in the study. [Dataset]. http://doi.org/10.1371/journal.pone.0259258.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mandana Rezaeiahari; Clare C. Brown; Mir M. Ali; Jyotishka Datta; J. Mick Tilford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Variables and data resources in the study.

  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wataru Takahara; Ryuto Baba; Yosuke Harashima; Tomoaki Takayama; Shogo Takasuka; Yuichi Yamaguchi; Akihiko Kudo; Mikiya Fujii (2025). Data-Driven Approach Considering Imbalance in Data Sets and Experimental Conditions for Exploration of Photocatalysts [Dataset]. http://doi.org/10.1021/acsomega.4c06997.s001

Data from: Data-Driven Approach Considering Imbalance in Data Sets and Experimental Conditions for Exploration of Photocatalysts

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Apr 10, 2025
Dataset provided by
ACS Publications
Authors
Wataru Takahara; Ryuto Baba; Yosuke Harashima; Tomoaki Takayama; Shogo Takasuka; Yuichi Yamaguchi; Akihiko Kudo; Mikiya Fujii
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

In the field of data-driven material development, an imbalance in data sets where data points are concentrated in certain regions often causes difficulties in building regression models when machine learning methods are applied. One example of inorganic functional materials facing such difficulties is photocatalysts. Therefore, advanced data-driven approaches are expected to help efficiently develop novel photocatalytic materials even if an imbalance exists in data sets. We propose a two-stage machine learning model aimed at handling imbalanced data sets without data thinning. In this study, we used two types of data sets that exhibit the imbalance: the Materials Project data set (openly shared due to its public domain data) and the in-house metal-sulfide photocatalyst data set (not openly shared due to the confidentiality of experimental data). This two-stage machine learning model consists of the following two parts: the first regression model, which predicts the target quantitatively, and the second classification model, which determines the reliability of the values predicted by the first regression model. We also propose a search scheme for variables related to the experimental conditions based on the proposed two-stage machine learning model. This scheme is designed for photocatalyst exploration, taking experimental conditions into account as the optimal set of variables for these conditions is unknown. The proposed two-stage machine learning model improves the prediction accuracy of the target compared with that of the one-stage model.

Search
Clear search
Close search
Google apps
Main menu