14 datasets found
  1. d

    Replication Data for: \"A Topic-based Segmentation Model for Identifying...

    • search.dataone.org
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert (2024). Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\" [Dataset]. http://doi.org/10.7910/DVN/EE3DE2
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert
    Description

    We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...

  2. Data from: Dataset for Vector space model and the usage patterns of...

    • figshare.com
    bin
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gede Primahadi Wijaya Rajeg; Karlina Denistia; Simon Musgrave (2023). Dataset for Vector space model and the usage patterns of Indonesian denominal verbs [Dataset]. http://doi.org/10.6084/m9.figshare.8187155.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Gede Primahadi Wijaya Rajeg; Karlina Denistia; Simon Musgrave
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PrefaceThis is the data repository for the paper accepted for publication in NUSA's special issue on Linguistic studies using large annotated corpora (co-edited by Hiroki Nomoto and David Moeljadi).How to cite the datasetIf you use, adapt, and/or modify any of the dataset in this repository for your research or teaching purposes (except for the malindo_dbase, see below), please cite as:Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): Dataset for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Fileset. https://doi.org/10.6084/m9.figshare.8187155.Alternatively, click on the dark pink Cite button to browse different citation style (default is DataCite).The malindo_dbase data in this repository is from Nomoto et al. (2018) (cf the GitHub repository). So please also cite their work if you use it for your research:Nomoto, Hiroki, Hannah Choi, David Moeljadi and Francis Bond. 2018. MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian. Kiyoaki Shirai (ed.) Proceedings of the LREC 2018 Workshop "The 13th Workshop on Asian Language Resources", 36-43.Tutorial on how to use the data together with the R Markdown Notebook for the analyses is available on GitHub and figshare:Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): R Markdown Notebook for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Software. doi: https://doi.org/10.6084/m9.figshare.9970205Dataset description1. Leipzig_w2v_vector_full.bin is the vector space model used in the paper. We built it using wordVectors package (Schmidt & Li 2017) via the MonARCH High Performance Computing Cluster (We thank Philip Chan for his help with access to MonARCH).2. Files beginning with ngramexmpl_... are data for the n-grams (i.e. words sequence) of verbs discussed in the paper. The files are in tab-separated format.3. Files beginning with sentence_... are full sentences for the verbs discussed in the paper (in the plain text format and R dataset format [.rds]). Information of the corpus file and sentence number in which the verb is found are included.4. me_parsed_nountaggedbase (in three different file-formats) contains database of the me- words with noun-tagged root that MorphInd identified to occur in three morphological schemas we focus on (me-, me-/-kan, and me-/-i). The database has columns for the verbs' token frequency in the corpus, root forms, MorphInd parsing output, among others.5. wordcount_leipzig_allcorpus (in three different file-formats) contains information on the size of each corpus file used in the paper and from which the vector space model is built.6. wordlist_leipzig_ME_DI_TER_percorpus.tsv is a tab-separated frequency list of words prefixed with me-, di-, and ter- in all thirteen corpus files used. The wordlist is built by first tokenising each corpus file, lowercasing the tokens, and then extracting the words with the corresponding three prefixes using the following regular expressions: - For me-: ^(?i)(me)([a-z-]{3,})$- For di-: ^(?i)(di)([a-z-]{3,})$- For ter-: ^(?i)(ter)([a-z-]{3,})$7. malindo_dbase is the MALINDO Morphological Dictionary (see above).ReferencesSchmidt, Ben & Jian Li. 2017. wordVectors: Tools for creating and analyzing vector-space models of texts. R package. http://github.com/bmschmidt/wordVectors.

  3. r

    R code for analysis of Irukandji data of the GBR (NESP TWQ 2.2.3, CSIRO)

    • researchdata.edu.au
    bin
    Updated 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richardson, Anthony J, Prof (2019). R code for analysis of Irukandji data of the GBR (NESP TWQ 2.2.3, CSIRO) [Dataset]. https://researchdata.edu.au/r-code-analysis-223-csiro/1360980
    Explore at:
    binAvailable download formats
    Dataset updated
    2019
    Dataset provided by
    eAtlas
    Authors
    Richardson, Anthony J, Prof
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1985 - Dec 31, 2016
    Area covered
    Great Barrier Reef
    Description

    This dataset presents the code written for the analysis and modelling for the Jellyfish Forecasting System for NESP TWQ Project 2.2.3. The Jellyfish Forecasting System (JFS) searches for robust statistical relationships between historical sting events (and observations) and local environmental conditions. These relationships are tested using data to quantify the underlying uncertainties. They then form the basis for forecasting risk levels associated with current environmental conditions.

    The development of the JFS modelling and analysis is supported by the Venomous Jellyfish Database (sting events and specimen samples – November 2018) (NESP 2.2.3, CSIRO) with corresponding analysis of wind fields and tidal heights along the Queensland coastline. The code has been calibrated and tested for the study focus regions including Cairns (Beach, Island, Reef), Townsville (Beach, Island+Reef) and Whitsundays (Beach, Island+Reef).

    The JFS uses the European Centre for Medium-Range Weather forecasting (ECMWF) wind fields from the ERA Interim, Daily product (https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era-interim). This daily product has global coverage at a spatial resolution of approximately 80km. However, only 11 locations off the Queensland coast were extracted covering the period 1-Jan-1985 to 31-Dec-2016. For the modelling, the data has been transformed into CSV files containing date, eastward wind (m/s) and northward wind (m/s), for each of the 11 geographical locations.

    Hourly tidal height was calculated from tidal harmonics supplied by the Bureau of Meteorology (http://www.bom.gov.au/oceanography/projects/ntc/ntc.shtml) using the XTide software (http://www.flaterco.com/xtide/). Hourly tidal heights have been calculated for 7 sites along the Queensland coast (Albany Island, Cairns, Cardwell, Cooktown, Fife, Grenville, Townsville) for the period 1-Jan-1985 to 31-Dec-2017. Data has been transformed into CSV files, one for each of the 7 sites. Columns correspond to number of days since 1-Jan 1990 and tidal height (m).

    Irukandji stings were then modelled using a generalised linear model (GLM). A GLM generalises ordinary linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value (McCullagh & Nelder 1989). For each region, we used a GLM with the number of Irukandji stings per day as the response variable. The GLM had a Poisson error structure and a log link function (Crawley 2005). For the Poisson GLMs, we inferred absences when stings were not recorded in the data for a day. We consider that there was reasonably consistent sampling effort in the database since 1985, but very patchy prior to this date. It should be noted that Irukandji are very patchy in time; for example, there was a single sting record in 2017 despite considerable effort trying to find stings in that year. Although the database might miss small and localised Irukandji sting events, we believe it captures larger infestation events.

    We included six predictors in the models: Month, two wind variables, and three tidal variables. Month was a factor and arranged so that the summer was in the middle of the year (i.e., from June to May). The two wind variables were Speed and Direction. For each day within each region (Cairns, Townsville or Whitsundays), hourly wind-speed and direction was used. We derived cumulative wind Speed and Direction, working backwards from each day, with the current day being Day 1. We calculated cumulative winds from the current day (Day 1) to 14 days previously for every day in every Region and Area. To provide greater weighting for winds on more recent days, we used an inverse weighting for each day, where the weighting was given by 1/i for each day i. Thus, the Cumulative Speed for n days is given by:

    Cumulative Speed_n=(\sum_(i=1)^n Speed_i/i) / (\sum_(i=1)^n 1/i)

    For example, calculations for the 3-day cumulative wind speed are:

    (1/1×Wind Day 1 + 1/2 × Wind Day 2 + 1/3 × Wind Day 3) / (1/1+1/2+1/3)

    Similarly, we calculated the cumulative weighted wind Direction using the formula:

    Cumulative Direction_n=(\sum_(i=1)^n Direction_i/i) / (\sum_(i=1)^n 1/i)

    We used circular statistics in the R Package Circular to calculate the weighted cumulative mean, because direction 0º is the same as 360º. We initially used a smoother for this term in the model, but because of its non-linearity and the lack of winds of all directions, we found that it was better to use wind Direction as a factor with four levels (NW, NE, SE and SW). In some Regions and Areas, not all wind Directions were present.

    To assign each event to the tidal cycle, we used tidal data from the closest of the seven stations to calculate three tidal variables: (i) the tidal range each day (m); (ii) the tidal height (m); and (iii) whether the tide was incoming or outgoing. To estimate the three tidal variables, the time of day of the event was required. However, the Time of Day was only available for 780 observations, and the 291 missing observations were estimated assuming a random Time of Day, which will not influence the relationship but will keep these rows in the analysis. Tidal range was not significant in any models and will not be considered further.

    To focus on times when Irukandji were present, months when stings never occurred in an area/region were excluded from the analysis – this is generally the winter months. For model selection, we used Akaike Information Criterion (AIC), which is an estimate of the relative quality of models given the data, to choose the most parsimonious model. We thus do not talk about significant predictors, but important ones, consistent with information theoretic approaches.

    Limitations: It is important to note that while the presence of Irukandji is more likely on high risk days, the forecasting system should not be interpreted as predicting the presence of Irukandji or that stings will occur.

    Format:

    It is a text file with a .r extension, the default code format in R. This code runs on the csv datafile “VJD_records_EXTRACT_20180802_QLD.csv” that has latitude, longitude, date, and time of day for each Irukandji sting on the GBR. A subset of these data have been made publicly available through eAtlas, but not all data could be made publicly available because of permission issues. For more information about data permissions, please contact Dr Lisa Gershwin (lisa.gershwin@stingeradvisor.com).

    Data Location:

    This dataset is filed in the eAtlas enduring data repository at: data\custodian\2016-18-NESP-TWQ-2\2.2.3_Jellyfish-early-warning\data\ and https://github.com/eatlas/NESP_2.2.3_Jellyfish-early-warning

  4. d

    Input Files and WRTDS Model Output for the two major tributaries of Lake...

    • catalog.data.gov
    • s.cnmilf.com
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Input Files and WRTDS Model Output for the two major tributaries of Lake Koocanusa: Water Quality [Dataset]. https://catalog.data.gov/dataset/input-files-and-wrtds-model-output-for-the-two-major-tributaries-of-lake-koocanusa-water-q
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Lake Koocanusa
    Description

    Canadian discrete water quality data and daily streamflow records were evaluated using the Weighted Regression on Time, Discharge, and Seasons (WRTDS) model implemented with the EGRET R package (Hirsch et al. 2010, Hirsch and De Cicco 2015). Models were used to estimate loads of solutes and evaluate trends for three constituents of interest (selenium, nitrogen, and sulfate). Six models were generated; one model for each of the three constituents of interest, in each of the two major tributaries to Lake Koocanusa: the Kootenay River at Fenwick (BC08NG0009), and the Elk River above Highway 93 Near Elko (BC08NK0003). Data were obtained by downloading data from the British Columbia Water Tool (https://kwt.bcwatertool.ca/surface-water-quality, https://kwt.bcwatertool.ca/streamflow) and Environment Climate Change Canada (https://open.canada.ca/data/en/dataset/c2adcb27-6d7e-4e97-b546-b8ee3d586aa4/resource/7bb8d1ff-f446-494f-8f3d-ad252162eef5?inner_span=True). This data release consists of two input data files and one output file from the EGRET model estimation (eList) which contains the WRTDS model, for each site and constituent. The input datasets include a daily discharge data file and a measured concentration data file. The period for the water quality data varies among the constituents and sites. Likewise, the output file time period aligns with the input files and varies among the 6 models. Nitrate in the Elk River at Highway 93 has the longest period of record from 1979 to 2022. Water quality sampling at the Fenwick station was discontinued in 2019, so all models for the Kootenay end after 2019. This data release also contains mass removal data provided by Teck Coal Limited, which were incorporated into a sub-analysis that used the WRTDS selenium model for the Elk River. This child item contains only the water quality files. The WRTDS model was run at a daily time step. Model performance evaluations, including a visual assessment of model fit and residuals and bias correction factors were completed. Model output for each parameter at each site (6 total) is published here in an eLists (.rds file). The format of each eList is standardized per EGRET processing. See Hirsch and De Cicco (2015) for description of these files. WRTDS_Kalman estimates can also be evaluated by running additional functions with the published eLists published. To prevent redundancy they were excluded from this output. For the Kalman models nitrate specified a rho of 0.95 while the other models used the default (0.9). Citations: Hirsch, R.M., and De Cicco, L.A., 2015, User guide to Exploration and Graphics for RivEr Trends (EGRET) and dataRetrieval—R packages for hydrologic data (version 2.0, February 2015): U.S. Geological Survey Techniques and Methods book 4, chap. A10, 93 p., http://dx.doi.org/10.3133/tm4A10. Hirsch, R.M., Moyer, D.L., and Archfield, S.A., 2010, Weighted Regressions on Time, Discharge, and Season (WRTDS), With an Application to Chesapeake Bay River Inputs: Journal of the American Water Resources Association (JAWRA), v. 46, no. 5, 857-880 p., DOI: http://dx.doi.org/10.1111/j.1752-1688.2010.00482.x.

  5. S

    Data of the REST-meta-MDD Project from DIRECT Consortium

    • scidb.cn
    Updated Jun 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chao-Gan Yan; Xiao Chen; Le Li; Francisco Xavier Castellanos; Tong-Jian Bai; Qi-Jing Bo; Jun Cao; Guan-Mao Chen; Ning-Xuan Chen; Wei Chen; Chang Cheng; Yu-Qi Cheng; Xi-Long Cui; Jia Duan; Yi-Ru Fang; Qi-Yong Gong; Wen-Bin Guo; Zheng-Hua Hou; Lan Hu; Li Kuang; Feng Li; Tao Li; Yan-Song Liu; Zhe-Ning Liu; Yi-Cheng Long; Qing-Hua Luo; Hua-Qing Meng; Dai-Hui Peng; Hai-Tang Qiu; Jiang Qiu; Yue-Di Shen; Yu-Shu Shi; Yan-Qing Tang; Chuan-Yue Wang; Fei Wang; Kai Wang; Li Wang; Xiang Wang; Ying Wang; Xiao-Ping Wu; Xin-Ran Wu; Chun-Ming Xie; Guang-Rong Xie; Hai-Yan Xie; Peng Xie; Xiu-Feng Xu; Hong Yang; Jian Yang; Jia-Shu Yao; Shu-Qiao Yao; Ying-Ying Yin; Yong-Gui Yuan; Ai-Xia Zhang; Hong Zhang; Ke-Rang Zhang; Lei Zhang; Zhi-Jun Zhang; Ru-Bai Zhou; Yi-Ting Zhou; Jun-Juan Zhu; Chao-Jie Zou; Tian-Mei Si; Xi-Nian Zuo; Jing-Ping Zhao; Yu-Feng Zang (2022). Data of the REST-meta-MDD Project from DIRECT Consortium [Dataset]. http://doi.org/10.57760/sciencedb.o00115.00013
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 20, 2022
    Dataset provided by
    Science Data Bank
    Authors
    Chao-Gan Yan; Xiao Chen; Le Li; Francisco Xavier Castellanos; Tong-Jian Bai; Qi-Jing Bo; Jun Cao; Guan-Mao Chen; Ning-Xuan Chen; Wei Chen; Chang Cheng; Yu-Qi Cheng; Xi-Long Cui; Jia Duan; Yi-Ru Fang; Qi-Yong Gong; Wen-Bin Guo; Zheng-Hua Hou; Lan Hu; Li Kuang; Feng Li; Tao Li; Yan-Song Liu; Zhe-Ning Liu; Yi-Cheng Long; Qing-Hua Luo; Hua-Qing Meng; Dai-Hui Peng; Hai-Tang Qiu; Jiang Qiu; Yue-Di Shen; Yu-Shu Shi; Yan-Qing Tang; Chuan-Yue Wang; Fei Wang; Kai Wang; Li Wang; Xiang Wang; Ying Wang; Xiao-Ping Wu; Xin-Ran Wu; Chun-Ming Xie; Guang-Rong Xie; Hai-Yan Xie; Peng Xie; Xiu-Feng Xu; Hong Yang; Jian Yang; Jia-Shu Yao; Shu-Qiao Yao; Ying-Ying Yin; Yong-Gui Yuan; Ai-Xia Zhang; Hong Zhang; Ke-Rang Zhang; Lei Zhang; Zhi-Jun Zhang; Ru-Bai Zhou; Yi-Ting Zhou; Jun-Juan Zhu; Chao-Jie Zou; Tian-Mei Si; Xi-Nian Zuo; Jing-Ping Zhao; Yu-Feng Zang
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    (Note: Part of the content of this post was adapted from the original DIRECT Psychoradiology paper (https://academic.oup.com/psyrad/article/2/1/32/6604754) and REST-meta-MDD PNAS paper (http://www.pnas.org/cgi/doi/10.1073/pnas.1900390116) under CC BY-NC-ND license.)Major Depressive Disorder (MDD) is the second leading cause of health burden worldwide (1). Unfortunately, objective biomarkers to assist in diagnosis are still lacking, and current first-line treatments are only modestly effective (2, 3), reflecting our incomplete understanding of the pathophysiology of MDD. Characterizing the neurobiological basis of MDD promises to support developing more effective diagnostic approaches and treatments.An increasingly used approach to reveal neurobiological substrates of clinical conditions is termed resting-state functional magnetic resonance imaging (R-fMRI) (4). Despite intensive efforts to characterize the pathophysiology of MDD with R-fMRI, clinical imaging markers of diagnosis and predictors of treatment outcomes have yet to be identified. Previous reports have been inconsistent, sometimes contradictory, impeding the endeavor to translate them into clinical practice (5). One reason for inconsistent results is low statistical power from small sample size studies (6). Low-powered studies are more prone to produce false positive results, reducing the reproducibility of findings in a given field (7, 8). Of note, one recent study demonstrate that sample size of thousands of subjects may be needed to identify reproducible brain-wide association findings (9), calling for larger datasets to boost effect size. Another reason could be the high analytic flexibility (10). Recently, Botvinik-Nezer and colleagues (11) demonstrated the divergence in results when independent research teams applied different workflows to process an identical fMRI dataset, highlighting the effects of “researcher degrees of freedom” (i.e., heterogeneity in (pre-)processing methods) in producing disparate fMRI findings.To address these critical issues, we initiated the Depression Imaging REsearch ConsorTium (DIRECT) in 2017. Through a series of meetings, a group of 17 participating hospitals in China agreed to establish the first project of the DIRECT consortium, the REST-meta-MDD Project, and share 25 study cohorts, including R-fMRI data from 1300 MDD patients and 1128 normal controls. Based on prior work, a standardized preprocessing pipeline adapted from Data Processing Assistant for Resting-State fMRI (DPARSF) (12, 13) was implemented at each local participating site to minimize heterogeneity in preprocessing methods. R-fMRI metrics can be vulnerable to physiological confounds such as head motion (14, 15). Based on our previous work examination of head motion impact on R-fMRI FC connectomes (16) and other recent benchmarking studies (15, 17), DPARSF implements a regression model (Friston-24 model) on the participant-level and group-level correction for mean frame displacements (FD) as the default setting.In the REST-meta-MDD Project of the DIRECT consortium, 25 research groups from 17 hospitals in China agreed to share final R-fMRI indices from patients with MDD and matched normal controls (see Supplementary Table; henceforth “site” refers to each cohort for convenience) from studies approved by local Institutional Review Boards. The consortium contributed 2428 previously collected datasets (1300 MDDs and 1128 NCs). On average, each site contributed 52.0±52.4 patients with MDD (range 13-282) and 45.1±46.9 NCs (range 6-251). Most MDD patients were female (826 vs. 474 males), as expected. The 562 patients with first episode MDD included 318 first episode drug-naïve (FEDN) MDD and 160 scanned while receiving antidepressants (medication status unavailable for 84). Of 282 with recurrent MDD, 121 were scanned while receiving antidepressants and 76 were not being treated with medication (medication status unavailable for 85). Episodicity (first or recurrent) and medication status were unavailable for 456 patients.To improve transparency and reproducibility, our analysis code has been openly shared at https://github.com/Chaogan-Yan/PaperScripts/tree/master/Yan_2019_PNAS. In addition, we would like to share the R-fMRI indices of the 1300 MDD patients and 1128 NCs through the R-fMRI Maps Project (http://rfmri.org/REST-meta-MDD). These data derivatives will allow replication, secondary analyses and discovery efforts while protecting participant privacy and confidentiality.According to the agreement of the REST-meta-MDD consortium, there would be 2 phases for sharing the brain imaging data and phenotypic data of the 1300 MDD patients and 1128 NCs. 1) Phase 1: coordinated sharing, before January 1, 2020. To reduce conflict of the researchers, the consortium will review and coordinate the proposals submitted by interested researchers. The interested researchers first send a letter of intent to rfmrilab@gmail.com. Then the consortium will send all the approved proposals to the applicant. The applicant should submit a new innovative proposal while avoiding conflict with approved proposals. This proposal would be reviewed and approved by the consortium if no conflict. Once approved, this proposal would enter the pool of approved proposals and prevent future conflict. 2) Phase 2: unrestricted sharing, after January 1, 2020. The researchers can perform any analyses of interest while not violating ethics.The REST-meta-MDD data entered unrestricted sharing phase since January 1, 2020. The researchers can perform any analyses of interest while not violating ethics. Please visit Psychological Science Data Bank to download the data, and then sign the Data Use Agreement and email the scanned signed copy to rfmrilab@gmail.com to get unzip password and phenotypic information. ACKNOWLEDGEMENTSThis work was supported by the National Key R&D Program of China (2017YFC1309902), the National Natural Science Foundation of China (81671774, 81630031, 81471740 and 81371488), the Hundred Talents Program and the 13th Five-year Informatization Plan (XXH13505) of Chinese Academy of Sciences, Beijing Municipal Science & Technology Commission (Z161100000216152, Z171100000117016, Z161100002616023 and Z171100000117012), Department of Science and Technology, Zhejiang Province (2015C03037) and the National Basic Research (973) Program (2015CB351702). REFERENCES1. A. J. Ferrari et al., Burden of Depressive Disorders by Country, Sex, Age, and Year: Findings from the Global Burden of Disease Study 2010. PLOS Medicine 10, e1001547 (2013).2. L. M. Williams et al., International Study to Predict Optimized Treatment for Depression (iSPOT-D), a randomized clinical trial: rationale and protocol. Trials 12, 4 (2011).3. S. J. Borowsky et al., Who is at risk of nondetection of mental health problems in primary care? J Gen Intern Med 15, 381-388 (2000).4. B. B. Biswal, Resting state fMRI: a personal history. Neuroimage 62, 938-944 (2012).5. C. G. Yan et al., Reduced default mode network functional connectivity in patients with recurrent major depressive disorder. Proc Natl Acad Sci U S A 116, 9078-9083 (2019).6. K. S. Button et al., Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci 14, 365-376 (2013).7. J. P. A. Ioannidis, Why Most Published Research Findings Are False. PLOS Medicine 2, e124 (2005).8. R. A. Poldrack et al., Scanning the horizon: towards transparent and reproducible neuroimaging research. Nat Rev Neurosci 10.1038/nrn.2016.167 (2017).9. S. Marek et al., Reproducible brain-wide association studies require thousands of individuals. Nature 603, 654-660 (2022).10. J. Carp, On the Plurality of (Methodological) Worlds: Estimating the Analytic Flexibility of fMRI Experiments. Frontiers in Neuroscience 6, 149 (2012).11. R. Botvinik-Nezer et al., Variability in the analysis of a single neuroimaging dataset by many teams. Nature 10.1038/s41586-020-2314-9 (2020).12. C.-G. Yan, X.-D. Wang, X.-N. Zuo, Y.-F. Zang, DPABI: Data Processing & Analysis for (Resting-State) Brain Imaging. Neuroinformatics 14, 339-351 (2016).13. C.-G. Yan, Y.-F. Zang, DPARSF: A MATLAB Toolbox for "Pipeline" Data Analysis of Resting-State fMRI. Frontiers in systems neuroscience 4, 13 (2010).14. R. Ciric et al., Mitigating head motion artifact in functional connectivity MRI. Nature protocols 13, 2801-2826 (2018).15. R. Ciric et al., Benchmarking of participant-level confound regression strategies for the control of motion artifact in studies of functional connectivity. NeuroImage 154, 174-187 (2017).16. C.-G. Yan et al., A comprehensive assessment of regional variation in the impact of head micromovements on functional connectomics. NeuroImage 76, 183-201 (2013).17. L. Parkes, B. Fulcher, M. Yücel, A. Fornito, An evaluation of the efficacy, reliability, and sensitivity of motion correction strategies for resting-state functional MRI. NeuroImage 171, 415-436 (2018).18. L. Wang et al., Interhemispheric functional connectivity and its relationships with clinical characteristics in major depressive disorder: a resting state fMRI study. PLoS One 8, e60191 (2013).19. L. Wang et al., The effects of antidepressant treatment on resting-state functional brain networks in patients with major depressive disorder. Hum Brain Mapp 36, 768-778 (2015).20. Y. Liu et al., Regional homogeneity associated with overgeneral autobiographical memory of first-episode treatment-naive patients with major depressive disorder in the orbitofrontal cortex: A resting-state fMRI study. J Affect Disord 209, 163-168 (2017).21. X. Zhu et al., Evidence of a dissociation pattern in resting-state default mode network connectivity in first-episode, treatment-naive major depression patients. Biological psychiatry 71, 611-617 (2012).22. W. Guo et al., Abnormal default-mode

  6. Data from: Harmonized chronologies of a global late Quaternary pollen...

    • doi.pangaea.de
    • service.tib.eu
    html, tsv
    Updated Jun 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chenzhi Li; Alexander Postl; Thomas Böhmer; Andrew M Dolman; Ulrike Herzschuh (2021). Harmonized chronologies of a global late Quaternary pollen dataset (LegacyAge 1.0) [Dataset]. http://doi.org/10.1594/PANGAEA.933132
    Explore at:
    tsv, htmlAvailable download formats
    Dataset updated
    Jun 28, 2021
    Dataset provided by
    PANGAEA
    Authors
    Chenzhi Li; Alexander Postl; Thomas Böhmer; Andrew M Dolman; Ulrike Herzschuh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 26, 1938 - Mar 18, 2014
    Area covered
    Variables measured
    Site, Type, LATITUDE, Continent, ELEVATION, LONGITUDE, Replicates, Description, Event label, Location type, and 6 more
    Description

    This dataset presents global revised age models for taxonomically harmonized fossil pollen records. The age-depth models were established from mostly Intcal20-calibrated radiocarbon datings with a predefined parameter setting. 1032 sites are located in North America, 1075 sites in Europe, 488 sites in Asia. In the Southern Hemisphere, there are 150 sites in South America, 54 in Africa, and 32 in the Indopacific region. Datings, mostly C14, were retrieved from the Neotoma Paleoecology Database (https://www.neotomadb.org/), with additional data from Cao et al. (2020; https://doi.org/10.5194/essd-12-119-2020), Cao et al. (2013, https://doi.org/10.1016/j.revpalbo.2013.02.003) and our own collection. The related age records were revised by applying a similar approach, i.e., using the Bayesian age-depth modeling routine in R-BACON software. […]

  7. d

    Data from: SIDER: an R package for predicting trophic discrimination factors...

    • search.dataone.org
    • datadryad.org
    Updated Apr 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Healy; Thomas Guillerme; Seán B. A. Kelly; Richard Inger; Stuart Bearhop; Andrew L. Jackson (2025). SIDER: an R package for predicting trophic discrimination factors of consumers based on their ecology and phylogenetic relatedness [Dataset]. http://doi.org/10.5061/dryad.c6035
    Explore at:
    Dataset updated
    Apr 13, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Kevin Healy; Thomas Guillerme; Seán B. A. Kelly; Richard Inger; Stuart Bearhop; Andrew L. Jackson
    Time period covered
    Jan 1, 2017
    Description

    Stable isotope mixing models (SIMMs) are an important tool used to study species’ trophic ecology. These models are dependent on, and sensitive to, the choice of trophic discrimination factors (TDF) representing the offset in stable isotope delta values between a consumer and their food source when they are at equilibrium. Ideally, controlled feeding trials should be conducted to determine the appropriate TDF for each consumer, tissue type, food source, and isotope combination used in a study. In reality however, this is often not feasible nor practical. In the absence of species-specific information, many researchers either default to an average TDF value for the major taxonomic group of their consumer, or they choose the nearest phylogenetic neighbour for which a TDF is available. Here, we present the SIDER package for R, which uses a phylogenetic regression model based on a compiled dataset to impute (estimate) a TDF of a consumer. We apply information on the tissue type and feeding ...

  8. f

    Data_Sheet_1_autohrf-an R package for generating data-informed event models...

    • frontiersin.figshare.com
    • figshare.com
    pdf
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nina Purg; Jure Demšar; Grega Repovš (2023). Data_Sheet_1_autohrf-an R package for generating data-informed event models for general linear modeling of task-based fMRI data.pdf [Dataset]. http://doi.org/10.3389/fnimg.2022.983324.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Nina Purg; Jure Demšar; Grega Repovš
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The analysis of task-related fMRI data at the level of individual participants is commonly based on general linear modeling (GLM), which allows us to estimate the extent to which the BOLD signal can be explained by the task response predictors specified in the event model. The predictors are constructed by convolving the hypothesized time course of neural activity with an assumed hemodynamic response function (HRF). However, our assumptions about the components of brain activity, including their onset and duration, may be incorrect. Their timing may also differ across brain regions or from person to person, leading to inappropriate or suboptimal models, poor fit of the model to actual data, and invalid estimates of brain activity. Here, we present an approach that uses theoretically driven models of task response to define constraints on which the final model is computationally derived using actual fMRI data. Specifically, we developed autohrf–an R package that enables the evaluation and data-driven estimation of event models for GLM analysis. The highlight of the package is the automated parameter search that uses genetic algorithms to find the onset and duration of task predictors that result in the highest fitness of GLM based on the fMRI signal under predefined constraints. We evaluated the usefulness of the autohrf package on two original datasets of task-related fMRI activity, a slow event-related spatial working memory study and a mixed state-item study using the flanker task, and on a simulated slow event-related working memory data. Our results suggest that autohrf can be used to efficiently construct and evaluate better task-related brain activity models to gain a deeper understanding of BOLD task response and improve the validity of model estimates. Our study also highlights the sensitivity of fMRI analysis with GLM to precise event model specification and the need for model evaluation, especially in complex and overlapping event designs.

  9. n

    Data for: A modified Michaelis-Menten equation estimates growth from birth...

    • data.niaid.nih.gov
    • search.dataone.org
    zip
    Updated Jan 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Catherine Ley; William Walters (2024). Data for: A modified Michaelis-Menten equation estimates growth from birth to 3 years in healthy babies in the US [Dataset]. http://doi.org/10.5061/dryad.4j0zpc8jf
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 22, 2024
    Dataset provided by
    Max Planck Institute for Biology
    Stanford University School of Medicine
    Authors
    Catherine Ley; William Walters
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background: Standard pediatric growth curves cannot be used to impute missing height or weight measurements in individual children. The Michaelis-Menten equation, used for characterizing substrate-enzyme saturation curves, has been shown to model growth in many organisms including nonhuman vertebrates. We investigated whether this equation could be used to interpolate missing growth data in children in the first three years of life and compared this interpolation to several common interpolation methods and pediatric growth models. Methods: We developed a modified Michaelis-Menten equation and compared expected to actual growth, first in a local birth cohort (N=97) and then in a large, outpatient, pediatric sample (N=14,695). Results: The modified Michaelis-Menten equation showed excellent fit for both infant weight (median RMSE: boys: 0.22kg [IQR:0.19; 90%<0.43]; girls: 0.20kg [IQR:0.17; 90%<0.39]) and height (median RMSE: boys: 0.93cm [IQR:0.53; 90%<1.0]; girls: 0.91cm [IQR:0.50;90%<1.0]). Growth data were modeled accurately with as few as four values from routine well-baby visits in year 1 and seven values in years 1-3; birth weight or length was essential for best fit. Interpolation with this equation had comparable (for weight) or lower (for height) mean RMSE compared to the best-performing alternative models. Conclusions: A modified Michaelis-Menten equation accurately describes growth in healthy babies aged 0–36 months, allowing interpolation of missing weight and height values in individual longitudinal measurement series. The growth pattern in healthy babies in resource-rich environments mirrors an enzymatic saturation curve. Methods Sources of data: Information on infants was ascertained from two sources: the STORK birth cohort and the STARR research registry. (1) Detailed methods for the STORK birth cohort have been described previously. In brief, a multiethnic cohort of mothers and babies was followed from the second trimester of pregnancy to the babies’ third birthday. Healthy women aged 18–42 years with a single-fetus pregnancy were enrolled. Households were visited every four months until the baby’s third birthday (nine baby visits), with the weight of the baby at each visit recorded in pounds. Medical charts were abstracted for birth weight and length. (2) STARR (starr.stanford.edu) contains electronic medical record information from all pediatric and adult patients seen at Stanford Health Care (Stanford, CA). STARR staff provided anonymized information (weight, height and age in days for each visit through age three years; sex; race/ethnicity) for all babies during the period 03/2013–01/2022 followed from birth to at least 36 months of age with at least five well-baby care visits over the first year of life.
    Inclusion of data for modeling: All observed weight and height values were evaluated in kilograms (kg) and centimeters (cm), respectively. Any values assessed beyond 1,125 days (roughly 36 months) and values for height and weight deemed implausible by at least two reviewers (e.g., significant losses in height, or marked outliers for weight and height) were excluded from the analysis. Additionally, weights assessed between birth and 19 days were excluded. At least five observations across the 36-month period were required: babies with fewer than five weight or height values after the previous criteria were excluded from analyses. Model: We developed our weight model using values from STORK babies and then replicated it with values from the STARR babies. Height models were evaluated in STARR babies only because STORK data on height were scant. The Michaelis-Menten equation is described as follows: v = Vmax ([S]/(Km + [S]) , where v is the rate of product formation, Vmax is the maximum rate of the system, [S] is the substrate concentration, and Km is a constant based upon the enzyme’s affinity for the particular substrate. For this study the equation became: P = a1 (Age/(b1+ Age)) + c1, where P was the predicted value of weight (kg) or height (cm), Age was the age of the infant in days, and c1 was an additional constant over the original Michaelis-Menten equation that accounted for the infant’s non-zero weight or length at birth. Each of the parameters a1, b1 and c1 was unique to each child and was calculated using the nonlinear least squares (nls) method. In our case, weight data were fitted to a model using the statistical language R, by calling the formula nls() with the following parameters: fitted_model <-nls(weights~(c1+(a1*ages)/(b1+ages)), start = list(a1 = 5, b1 = 20, c1=2.5)), where weights and ages were vectors of each subject’s weight in kg and age in days. The default Gauss-Newton algorithm was used. The optimization objective is not convex in the parameters and can suffer from local optima and boundary conditions. In such cases good starting values are essential: the starting parameter values (a1=5, b1=20, c1=2.5) were adjusted manually using the STORK dataset to minimize model failures; these tended to occur when the parameter values, particularly a1 and b1, increased without bound during the iterative steps required to optimize the model. These same parameter values were used for the larger STARR dataset. The starting height parameter values for height modeling were higher than those for weight modeling, due to the different units involved (cm vs. kg) (a1=60, b1=530, c1=50). Because this was a non-linear model, goodness of fit was assessed primarily via root mean squared error (RMSE) for both weight and height. Imputation tests: To test for the influence of specific time points on the models, we limited our analysis to STARR babies with all recommended well-baby visits (12 over three years). Each scheduled visit except day 1 occurred in a time window around the expected well-baby visit (Visit1: Day 1, Visit2: days 20–44, Visit3: 46–90, Visit4: 95–148, Visit5: 158–225, Visit6: 250–298, Visit7: 310–399, Visit8: 410–490, Visit9: 500–600, Visit10: 640–800, Visit11: 842–982, Visit12: 1024–1125). We considered two different sets: infants with all scheduled visits in the first year of life (seven total visits) and those with all scheduled visits over the full three-year timeframe (12 total visits). We fit these two sets to the model, identifying baseline RMSE. Then, every visit, and every combination of two to five visits were dropped, so that the RMSE or model failures for a combination of visits could be compared to baseline. Prediction: We sought to predict weight or height at 36 months (Y3) from growth measures assessed only up to 12 months (Y1) or to 24 months (Y1+Y2), utilizing the “last value” approach. In brief, the last observation for each child (here, growth measures at 36 months) is used to assess overall model fit, by focusing on how accurately the model can extrapolate the measure at this time point. We identified all STARR infants with at least five time points in Y1 and at least two time points in both Y2 and Y3, with the selection of these time points based on maximizing the number of later time points within the constraints of the well-baby visit schedule for Y2 and Y3. The per-subject set of time points (Y1-Y3) was fitted using the modified Michaelis-Menten equation and the mean squared error was calculated, acting as the “baseline” error. The model was then run on the subset of Y1 only and of Y1+Y2 only. To test predictive accuracy of these subsets, the RMSE was calculated using the actual weights or heights versus the predicted weights or heights of the three time series. Comparison with other models: We examined how well the modified Michaelis-Menten equation performed interpolation in STARR babies compared to ten other commonly used interpolation methods and pediatric growth models including: (1) the ‘last observation carried forward’ model; (2) the linear model; (3) the robust linear model (RLM method, base R MASS package); (4) the Laird and Ware linear model (LWMOD method); (5) the generalized additive model (GAM method); (6) locally estimated scatterplot smoothing (LOESS method, base R stats package); (7) the smooth spline model (smooth.spline method, base R stats package); (8) the multilevel spline model (Wand method); (9) the SITAR (superimposition by translation and rotation) model and (10) fast covariance estimation (FACE method). Model fit used the holdout approach: a single datapoint (other than birth weight or birth length) was randomly removed from each subject, and the RMSE of the removed datapoint was calculated as the model fitted to the remaining data. The hbgd package was used to fit all models except the ‘last observation carried forward’ model, the linear model and the SITAR model. For the ‘last observation carried forward’ model, the holdout data point was interpolated by the last observation by converting the random holdout value to NA and then using the function na.locf() from the zoo R package. For the simple linear model, the holdout-filtered data were used to determine the slope and intercept via R’s lm() function, which were then used to calculate the holdout value. For the SITAR model, each subject was fitted by calling the sitar() function with df=2 to minimize failures, and the RMSE of the random holdout point was subsequently calculated with the predict() function. For this analysis, set.seed(1234) was used to initialize the pseudorandom generator.

  10. f

    Performance of the different kernel functions of SVR in the validation...

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahmoud Abdel-Sattar; Abdulwahed M. Aboukarima; Bandar M. Alnahdi (2023). Performance of the different kernel functions of SVR in the validation dataset with the default quantities used in the Weka software. [Dataset]. http://doi.org/10.1371/journal.pone.0245228.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Mahmoud Abdel-Sattar; Abdulwahed M. Aboukarima; Bandar M. Alnahdi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance of the different kernel functions of SVR in the validation dataset with the default quantities used in the Weka software.

  11. EPJSOIL SERENA WP3 T3.3 : France climatic and land use change modelling...

    • zenodo.org
    • catalogue.ejpsoil.eu
    • +2more
    Updated Nov 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blandine Lemercier; Blandine Lemercier; Didier Michot; Didier Michot; Christian Walter; Christian Walter; Antoine Boutier; Antoine Boutier; Arthur Gaillot; Arthur Gaillot; David Montagne; David Montagne; Christine Le Bas; Christine Le Bas (2024). EPJSOIL SERENA WP3 T3.3 : France climatic and land use change modelling dataset (Saclay) [Dataset]. http://doi.org/10.5281/zenodo.14000972
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Blandine Lemercier; Blandine Lemercier; Didier Michot; Didier Michot; Christian Walter; Christian Walter; Antoine Boutier; Antoine Boutier; Arthur Gaillot; Arthur Gaillot; David Montagne; David Montagne; Christine Le Bas; Christine Le Bas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    France, Saclay
    Description

    The internal EJP SOIL project SERENA contributed to the evaluation of soil multifunctionality aiming at providing assessment tools for land planning and soil policies at different scales. By co-working with relevant stakeholders, the project provided co-developed indicators and associated cookbooks to assess and map them, to report both on soil degradation, soil-based ecosystem services and their bundles, under actual conditions and for climate and land-use changes, at the regional, national, and European scales.

    Data aims to explore the effect of climate change according to different climate scenario up to 2050, and to understand the resistance of soil in response to climate change and different type of land use. Additionally, it aims to understand the relations between SES, and the main factors affecting Soil Ecosystem Services (SES) variations. Data has been produced by the SERENA team WP3 T3.3 France using JAVA-STICS 10.0.0 model and the STICSonR package, using the input files specified in the data. Results consit of the yearly data of SES calculated from the daily modeling from STICS. Because of the file size of daily results from STICS, such file are not part of the dataset. Input data and R scripts used are instead provided.

    Climatic data were obtained from SAFRAN climatic data provided by Météo-France and were downloaded via the SICLIMA platform developed by AgroClim-INRAE. Plants and fertilizers data come from default dataset from STICS. All input data are available as part of this dataset.

  12. f

    Proteomic signatures of human visceral and subcutaneous adipocytes -...

    • figshare.com
    txt
    Updated Oct 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel Hruška; Jan Kucera; Matej Pekar; Pavol Holeczy; Miloslav Mazur; Marek Buzga; Daniela Kuruczova; Peter Lenart; Jana Fialova Kucerova; David Potěšil; Zbyněk Zdráhal; Julie Bienertova-Vasku (2021). Proteomic signatures of human visceral and subcutaneous adipocytes - Supplementary files [Dataset]. http://doi.org/10.6084/m9.figshare.14626341.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 27, 2021
    Dataset provided by
    figshare
    Authors
    Pavel Hruška; Jan Kucera; Matej Pekar; Pavol Holeczy; Miloslav Mazur; Marek Buzga; Daniela Kuruczova; Peter Lenart; Jana Fialova Kucerova; David Potěšil; Zbyněk Zdráhal; Julie Bienertova-Vasku
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplemental InformationS File 1 – Dataset with normalized and imputed intensity valuesMaxQuant search proteinGroups.txt dataset with the following modifications: a) removal of decoy hits and contaminant protein groups; b) exclusion of 4 male sample pairs and an outlying sample pair; b) protein group intensities log2 transformation, c) LoessF normalization, and d) missing values imputation using the imp4p package; e) filtration of protein groups with less than 8 measured intensity values for VA or SA, and protein groups identified to less than 2 peptides within the SA or VA group of samples. The filtered dataset with normalized and imputed intensities was used for the comparative analysis using the Limma R package. S File 2 – Limma differential expression analysis resultsDifferential expression analysis using the LIMMA R package 28. The linear model used to compare paired differences between SA and VA samples was adjusted for batch effect by adding batch number as a variable in the model. The correlation between sample pairs was included in the linear model using appropriate functions from the LIMMA package 29. Subsequently, the results were adjusted for multiple hypothesis testing using the Benjamini and Hochberg procedure 30 implemented in the LIMMA package. S File 3a – SA Reactome over-representation pathway analysisThe file was retrieved submitting the list of UniProt accessions of all significantly upregulated SA proteins into the Reactome data analysis tool. S File 3b – VA Reactome over-representation pathway analysisThe file was retrieved submitting the list of UniProt accessions of all significantly upregulated VA proteins into the Reactome data analysis tool. S File 4a – Pathway enrichment analysis of all differentially expressed proteinsCytoscape ClueGO plugin Reactome pathways and reactions enrichment analysis results using all upregulated SA and VA proteins submitted as separate groups. The analysis was performed using default settings but showing only results with a p-value < 0.05. S File 4b – Pathway enrichment analysis of SA upregulated proteinsCytoscape ClueGO plugin Reactome pathways and reactions enrichment analysis results for SA upregulated proteins. The analysis was performed using default settings but showing only results with a p-value < 0.05. S File 4c – Pathway enrichment analysis of VA upregulated proteinsCytoscape ClueGO plugin Reactome pathways and reactions enrichment analysis results for VA upregulated proteins. The analysis was performed using default settings but showing only results with a p-value < 0.05. S File 5 – SignalP prediction of putative secreted proteinsThe output of putative secreted proteins analysis using SignalP-5.0 Server. This analysis was performed using the FASTA sequence of the most differentially expressed proteins with log2FC > 1 separately for SA and VA proteins.S File 6 – Dendrogram with modules Clustering dendrograms of the SA and VA proteins, respectively, with dissimilarity based on topological overlap, together with assigned module colours after the Dynamic tree cut and subsequent merging of highly similar modules (module eigengene correlation > 0.8). The colours were assigned independently for SA and VA dendrogram. S File 7 – SA WGCNA and module-trait relationships results The table contains the module membership and gene significance with the respective p-values of the WGCNA and module-trait relationship analysis for the SA protein expression. S File 8 – VA WGCNA and module-trait relationships results The table contains the module membership and gene significance with the respective p-values of the WGCNA and module-trait relationship analysis for the VA protein expression.

  13. f

    The 25 ICIs datasets incorporated.

    • plos.figshare.com
    xlsx
    Updated May 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hong Yang; Ying Shi; Anqi Lin; Chang Qi; Zaoqu Liu; Quan Cheng; Kai Miao; Jian Zhang; Peng Luo (2024). The 25 ICIs datasets incorporated. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012024.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 8, 2024
    Dataset provided by
    PLOS Computational Biology
    Authors
    Hong Yang; Ying Shi; Anqi Lin; Chang Qi; Zaoqu Liu; Quan Cheng; Kai Miao; Jian Zhang; Peng Luo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The activation levels of biologically significant gene sets are emerging tumor molecular markers and play an irreplaceable role in the tumor research field; however, web-based tools for prognostic analyses using it as a tumor molecular marker remain scarce. We developed a web-based tool PESSA for survival analysis using gene set activation levels. All data analyses were implemented via R. Activation levels of The Molecular Signatures Database (MSigDB) gene sets were assessed using the single sample gene set enrichment analysis (ssGSEA) method based on data from the Gene Expression Omnibus (GEO), The Cancer Genome Atlas (TCGA), The European Genome-phenome Archive (EGA) and supplementary tables of articles. PESSA was used to perform median and optimal cut-off dichotomous grouping of ssGSEA scores for each dataset, relying on the survival and survminer packages for survival analysis and visualisation. PESSA is an open-access web tool for visualizing the results of tumor prognostic analyses using gene set activation levels. A total of 238 datasets from the GEO, TCGA, EGA, and supplementary tables of articles; covering 51 cancer types and 13 survival outcome types; and 13,434 tumor-related gene sets are obtained from MSigDB for pre-grouping. Users can obtain the results, including Kaplan–Meier analyses based on the median and optimal cut-off values and accompanying visualization plots and the Cox regression analyses of dichotomous and continuous variables, by selecting the gene set markers of interest. PESSA (https://smuonco.shinyapps.io/PESSA/ OR http://robinl-lab.com/PESSA) is a large-scale web-based tumor survival analysis tool covering a large amount of data that creatively uses predefined gene set activation levels as molecular markers of tumors.

  14. f

    The significant correlation coefficient (r) and regression coefficient...

    • plos.figshare.com
    xls
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Alhidary; Pesigan Arturo; Ali Al-Waleedi; Ferima Coulibaly-Zerbo; Omar Faisal; Mahammad Al Mansour; Ali AL-Mudwahi; Mohammed Rajamanar; Ezechiel Bisalinkumi; Eshrak Al-Falahi; Latifah Ali (2025). The significant correlation coefficient (r) and regression coefficient determination for the variables related to the bed utilization rate, efficiency, and effectiveness. [Dataset]. http://doi.org/10.1371/journal.pone.0316583.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 5, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Ahmed Alhidary; Pesigan Arturo; Ali Al-Waleedi; Ferima Coulibaly-Zerbo; Omar Faisal; Mahammad Al Mansour; Ali AL-Mudwahi; Mohammed Rajamanar; Ezechiel Bisalinkumi; Eshrak Al-Falahi; Latifah Ali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The significant correlation coefficient (r) and regression coefficient determination for the variables related to the bed utilization rate, efficiency, and effectiveness.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert (2024). Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\" [Dataset]. http://doi.org/10.7910/DVN/EE3DE2

Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\"

Related Article
Explore at:
Dataset updated
Sep 25, 2024
Dataset provided by
Harvard Dataverse
Authors
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert
Description

We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...

Search
Clear search
Close search
Google apps
Main menu