50 datasets found

Data Mining For Business
kaggle.com
Updated May 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 7, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Balal H
Description
Dataset

This dataset was created by Balal H

Contents
o
Baseline Definition - Dataset - Open Data NI
admin.opendatani.gov.uk
Updated Jul 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Baseline Definition - Dataset - Open Data NI [Dataset]. https://admin.opendatani.gov.uk/dataset/baseline-definition2
Explore at:
Dataset updated
Jul 26, 2025
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
The primary objective from this project was to acquire historical shoreline information for all of the Northern Ireland coastline. Having this detailed understanding of the coast’s shoreline position and geometry over annual to decadal time periods is essential in any management of the coast.The historical shoreline analysis was based on all available Ordnance Survey maps and aerial imagery information. Analysis looked at position and geometry over annual to decadal time periods, providing a dynamic picture of how the coastline has changed since the start of the early 1800s.Once all datasets were collated, data was interrogated using the ArcGIS package – Digital Shoreline Analysis System (DSAS). DSAS is a software package which enables a user to calculate rate-of-change statistics from multiple historical shoreline positions. Rate-of-change was collected at 25m intervals and displayed both statistically and spatially allowing for areas of retreat/accretion to be identified at any given stretch of coastline.The DSAS software will produce the following rate-of-change statistics:Net Shoreline Movement (NSM) – the distance between the oldest and the youngest shorelines.Shoreline Change Envelope (SCE) – a measure of the total change in shoreline movement considering all available shoreline positions and reporting their distances, without reference to their specific dates.End Point Rate (EPR) – derived by dividing the distance of shoreline movement by the time elapsed between the oldest and the youngest shoreline positions.Linear Regression Rate (LRR) – determines a rate of change statistic by fitting a least square regression to all shorelines at specific transects.Weighted Linear Regression Rate (WLR) - calculates a weighted linear regression of shoreline change on each transect. It considers the shoreline uncertainty giving more emphasis on shorelines with a smaller error.The end product provided by Ulster University is an invaluable tool and digital asset that has helped to visualise shoreline change and assess approximate rates of historical change at any given coastal stretch on the Northern Ireland coast.
f
Dataset for: Some Remarks on the R2 for Clustering
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6124508.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wiley
Authors
Nicola Loperfido; Thaddeus Tarpey
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
Jigsaw Regression Based Data
kaggle.com
Updated Feb 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ankit Gupta (2022). Jigsaw Regression Based Data [Dataset]. https://www.kaggle.com/datasets/nkitgupta/jigsaw-regression-based-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 5, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ankit Gupta
Description
Data Files

This dataset contains 2 Files and 2 Folders

File 1 : train_data_version1.csv

File2 : train_data_version2.csv

File3 : train_data_version3.csv

Folder 1 : FastText-Jigsaw-100D

Folder 2 : FastText-Jigsaw-256D

Content

File 1 : This File is in CSV format contains two columns

Column 1 contains text data. This text data is preprocessed and balanced, balanced in the sense this data contains an equal number of non-toxic (with toxicity = 0) and toxic (with toxicity >= 0) comments.

Column 2 contains float data. This column stores information about the toxicity of text data.

File 2 : This File is in CSV format contains two columns

Column 1 contains text data. In this version of the file, we did implement some more pre-processing techniques like spelling corrections. Also, this dataset is balanced means this data contains an equal number of non-toxic (with toxicity = 0) and toxic (with toxicity >= 0) comments.

Column 2 contains float data. This column stores information about the toxicity of text data.

File 3 : This File is in CSV format contains two columns

Column 1 contains text data. In this version of the file, we did implement some more pre-processing techniques like spelling corrections. Also, this dataset is balanced means this data contains an equal number of non-toxic (with toxicity = 0) and toxic (with toxicity >= 0) comments.

Column 2 contains float data. This column stores information about the toxicity of text data.

Folder 1 : This folder contains 2 files of 100D fasttext word embeddings.

Jigsaw-Fasttext-Word-Embeddings.bin: This file is a binary file that will be used to load the fasttext embeddings for use.

Jigsaw-Fasttext-Word-Embeddings.bin.wv.vectors_ngrams.npy: This file contains word vectors.

Folder 2 : This folder contains 4 files of 256D fasttext word embeddings.

Jigsaw-Fasttext-Word-Embeddings-256D.bin: This file is a binary file that will be used to load the fasttext embeddings for use.

Jigsaw-Fasttext-Word-Embeddings-256D.bin.syn1neg.npy: This file contains word vectors.

Jigsaw-Fasttext-Word-Embeddings-256D.bin.wv.vectors_ngrams.npy: This file contains word vectors.

Jigsaw-Fasttext-Word-Embeddings-256D.bin.wv.vectors_vocab.npy: This file contains word vectors.

All the FastText Word embeddings in this dataset were learned using python's gensim library with window size = 4 and sg = 0 implies Continuous bag of words (CBOW) approach to learn word embeddings

Continuous bag of words (CBOW)

In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears. Consider our example sentence we take the word “jumps” as the center word, then its context is formed by words in its vicinity. If we take the context size of 2, then for our example, the context is given by brown, fox, over, the. CBOW uses the context words to predict the target word—jumps.

If you are interested then you can learn more about FastText from below attached resources:

Text-Representations

Word2Vec and FastText Word Embedding with Gensim

"https://www.analyticsvidhya.com/blog/2017/07/word-representations-text-classification-using-fasttext-nlp-facebook/">Text Classification & Word Representations using FastText (An NLP library by Facebook)
f
Performance of ML models on test data.
plos.figshare.com
xls
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Performance of ML models on test data. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0002475.t005
Dataset updated
Oct 31, 2023
Dataset provided by
PLOS Global Public Health
Authors
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.
f
Regression variable means (and standard errors) stratified by study village....
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Aug 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neff, Jason C.; Quandt, Amy; McCabe, J. Terrence; Salerno, Jonathan D.; Xu, Emilie; Hartter, Joel; Herrick, Jeffrey E.; Baird, Timothy D. (2020). Regression variable means (and standard errors) stratified by study village. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000532072
Explore at:
Dataset updated
Aug 6, 2020
Authors
Neff, Jason C.; Quandt, Amy; McCabe, J. Terrence; Salerno, Jonathan D.; Xu, Emilie; Hartter, Joel; Herrick, Jeffrey E.; Baird, Timothy D.
Description
Binary variables are reported as proportions.
NWM GWR code and data
figshare.com
zip
Updated Mar 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jingyi He; Ye Wei; Bailang Yu (2023). NWM GWR code and data [Dataset]. http://doi.org/10.6084/m9.figshare.21299253.v4
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21299253.v4
Dataset updated
Mar 11, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Jingyi He; Ye Wei; Bailang Yu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset and code used in a journal paper entitled Geographically weighted regression based on a network weight matrix: a case study using urbanization driving force data in China , published in the International Journal of Geographical Information Science.

Abstract: Geographically weighted regression (GWR) is a classical modeling method for dealing with spatial non-stationarity. It incorporates the distance decay effect in space to fit local regression models, where distance is defined as Euclidean distance. Although this definition has been expanded, it remains focused on physical distance. However, in the era of globalization and informatization, where the phenomenon of remotely close association is common, physical distance may not reflect real spatial proximity, and GWR based on physical distance has clear limitations. This paper proposes a geographically weighted regression based on a network weight matrix (NWM GWR) model. This does not rely on geographical location modeling; instead, it uses network distance to measure the proximity between two regions and weights observations by improving the kernel function to achieve distance attenuation. We adopt the population mobility network to establish a network weight matrix, modeling China’s urbanization and its multidimensional driving factors using network autocorrelation and NWM GWR methods. Results show that the NWM GWR model has more accurate fit and better stability than ordinary least squares and GWR models, and better reveals relationships between variables, which makes it suitable for modeling economic and social systems more broadly.
m
Organic Food Consumers Dataset
data.mendeley.com
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anand Ellur (2025). Organic Food Consumers Dataset [Dataset]. http://doi.org/10.17632/xpdr55sbw6.1
Explore at:
Unique identifier
https://doi.org/10.17632/xpdr55sbw6.1
Dataset updated
Mar 26, 2025
Authors
Anand Ellur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset split into three files. The data was collected from 527 organic food consumers. So, the files for factor analysis and logistic regression will have 527 respondents data while the file for clustering has only 401 organic food consumers data. Clustering was carried out only for those consuming organic food products.
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
f
Strengths and weaknesses of different methods.
plos.figshare.com
xls
Updated Oct 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Strengths and weaknesses of different methods. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0002475.t002
Dataset updated
Oct 31, 2023
Dataset provided by
PLOS Global Public Health
Authors
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.
w
Democracy time-series dataset: Variable labels.
data.wu.ac.at
csv, spss, stata, xls
Updated Oct 10, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Global (2013). Democracy time-series dataset: Variable labels. [Dataset]. https://data.wu.ac.at/odso/datahub_io/MjMxZWVkMTctZTRmMi00NmFjLWEwMmMtNGM5NGEzMmMzMzYy
Explore at:
spss, csv, xls, stataAvailable download formats
Dataset updated
Oct 10, 2013
Dataset provided by
Global
Description
Democracy Timeseries Data Release 3.0, January 2009

This dataset is in a country-year case format, suitable for time-series analysis. It contains data on the social, economic and political characteristics of 191 nations with over 600 variables from 1971 to 2007. It merges the indicators of democracy by Freedom House, Vanhanen, Polity IV, and Cheibub and Gandhi, plus selected institutional classifications and also socio-economic indicators from the World Bank. New variables including the KOF Globalization Index and the new Norris-Inglehart Cosmopolitan Index. Note that you should check the original codebooks for the meaning and definition of each of the variables. The period for each series also varies. Note that the Excel version is for Office 2007 only. This is the dataset used in the book, Driving Democracy.

January 2009

Stored in Stata, SPSS, Excel and CSV.
D
Adversarial validation for quantifying dissimilarity in geospatial machine...
phys-techsciences.datastations.nl
docx, rar, txt
Updated May 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanwen. Wang; Yanwen. Wang (2024). Adversarial validation for quantifying dissimilarity in geospatial machine learning prediction [Dataset]. http://doi.org/10.17026/PT/OPPCTP
Explore at:
docx(428448), rar(505156777), rar(657508458), txt(6305)Available download formats
Unique identifier
https://doi.org/10.17026/PT/OPPCTP
Dataset updated
May 16, 2024
Dataset provided by
DANS Data Station Physical and Technical Sciences
Authors
Yanwen. Wang; Yanwen. Wang
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
China Scholarship Council
Description
This data includes all datasets and codes for adversarial validation in geospatial machine learning prediction and corresponding experiments. Except for datasets (Brazil Amazon basion AGB dataset and synthetic species abundance dataset) and code, Reademe.txt explains each file's meaning.
Statistical analysis for: Mode I fracture of beech-adhesive bondline at...
zenodo.org
data.niaid.nih.gov
+1more
bin, csv, html, txt
Updated Oct 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Burnard; Michael Burnard; Jaka Gašper Pečnik; Jaka Gašper Pečnik (2022). Statistical analysis for: Mode I fracture of beech-adhesive bondline at three different temperatures [Dataset]. http://doi.org/10.5281/zenodo.6839197
Explore at:
csv, html, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6839197
Dataset updated
Oct 4, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael Burnard; Michael Burnard; Jaka Gašper Pečnik; Jaka Gašper Pečnik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset collects a raw dataset and a processed dataset derived from the raw dataset. There is a document containing the analytical code for statistical analysis of the processed dataset in .Rmd format and .html format.

The study examined some aspects of mechanical performance of solid wood composites. We were interested in certain properties of solid wood composites made using different adhesives with different grain orientations at the bondline, then treated at different temperatures prior to testing.

Performance was tested by assessing fracture energy and critical fracture energy, lap shear strength, and compression strength of the composites. This document concerns only the fracture properties, which are the focus of the related paper.

Notes:

* the raw data is provided in this upload, but the processing is not addressed here.
* the authors of this document are a subset of the authors of the related paper.
* this document and the related data files were uploaded at the time of submission for review. An update providing the doi of the related paper will be provided when it is available.
d
Replication Data for: \"A Topic-based Segmentation Model for Identifying...
search.dataone.org
dataverse.harvard.edu
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert (2024). Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\" [Dataset]. http://doi.org/10.7910/DVN/EE3DE2
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/EE3DE2
Dataset updated
Sep 25, 2024
Dataset provided by
Harvard Dataverse
Authors
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert
Description
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
d
Stream definition 900 cell threshold raster for Pennsylvania StreamStats
datasets.ai
data.usgs.gov
+3more
55
Updated Aug 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of the Interior (2024). Stream definition 900 cell threshold raster for Pennsylvania StreamStats [Dataset]. https://datasets.ai/datasets/stream-definition-900-cell-threshold-raster-for-pennsylvania-streamstats
Explore at:
55Available download formats
Dataset updated
Aug 28, 2024
Dataset authored and provided by
Department of the Interior
Area covered
Pennsylvania
Description
The U.S. Geological Survey (USGS), in cooperation with the Federal Emergency Management Agency, Pennsylvania Department of Environmental Protection, Pennsylvania Department of Transportation, and Susquehanna River Basin Commission, prepared hydro-conditioned geographic information systems (GIS) layers for use in the Pennsylvania StreamStats application. These data were used to update the peak flow and low flow regression equations for Pennsylvania. This dataset consists of stream definition 900 cell threshold rasters for each 8-digit Hydrologic Unit Code (HUC) area in Pennsylvania, one of the layer types needed to delineate watersheds within the HUC-8 areas, merged into a single dataset. The 59 HUCs represented by this dataset are 02040101, 02040102, 02040103, 02040104, 02040105, 02040106, 02040201, 02040202, 02040203, 02040205, 02050101, 02050102, 02050103, 02050104, 02050105, 02050106, 02050107, 02050201, 02050202, 02050203, 02050204, 02050205, 02050206, 02050301, 02050302, 02050303, 02050304, 02050305, 02050306, 02060002, 02060003, 02070002, 02070003, 02070004, 02070009, 04110003, 04120101, 04130002, 05010001, 05010002, 05010003, 05010004, 05010005, 05010006, 05010007, 05010008, 05010009, 05020001, 05020002, 05020003, 05020004, 05020005, 05020006, 05030101, 05030102, 05030103, 05030104, 05030105, and 05030106.
d
Dataset for: Statistical simulation of ocean current patterns using...
search.dataone.org
data.griidc.org
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu, Yonggang (2025). Dataset for: Statistical simulation of ocean current patterns using autoregressive logistic regression (ALR) models: A case study in the Gulf of Mexico [Dataset]. http://doi.org/10.7266/VTTJ99YX
Explore at:
Unique identifier
https://doi.org/10.7266/VTTJ99YX
Dataset updated
Feb 5, 2025
Dataset provided by
GRIIDC
Authors
Liu, Yonggang
Area covered
Gulf of Mexico (Gulf of America)
Description
This dataset supports the publication â€œStatistical simulation of ocean current patterns using autoregressive logistic regression (ALR) models: A case study in the Gulf of Mexico.â€ (https://doi.org/10.1016/j.ocemod.2019.02.010) This dataset includes historical dynamic topography and surface currents from satellite altimetry (1991-01-01 to 2017-08-29) as well as processed views of the Gulf of Mexico Loop Current. The processed data include a time history of 84 empirical orthogonal functions (EOFs), the results of automated pattern identification (principal component analysis and k-means clustering), and the results of three (3) autoregressive logistic regression (ALR) models.
f
Data_Sheet_1_Machine learning-based prediction of hospital prolonged length...
frontiersin.figshare.com
pdf
Updated Jul 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Addisu Jember Zeleke; Pierpaolo Palumbo; Paolo Tubertini; Rossella Miglio; Lorenzo Chiari (2023). Data_Sheet_1_Machine learning-based prediction of hospital prolonged length of stay admission at emergency department: a Gradient Boosting algorithm analysis.pdf [Dataset]. http://doi.org/10.3389/frai.2023.1179226.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2023.1179226.s001
Dataset updated
Jul 28, 2023
Dataset provided by
Frontiers
Authors
Addisu Jember Zeleke; Pierpaolo Palumbo; Paolo Tubertini; Rossella Miglio; Lorenzo Chiari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveThis study aims to develop and compare different models to predict the Length of Stay (LoS) and the Prolonged Length of Stay (PLoS) of inpatients admitted through the emergency department (ED) in general patient settings. This aim is not only to promote any specific model but rather to suggest a decision-supporting tool (i.e., a prediction framework).MethodsWe analyzed a dataset of patients admitted through the ED to the “Sant”Orsola Malpighi University Hospital of Bologna, Italy, between January 1 and October 26, 2022. PLoS was defined as any hospitalization with LoS longer than 6 days. We deployed six classification algorithms for predicting PLoS: Random Forest (RF), Support Vector Machines (SVM), Gradient Boosting (GB), AdaBoost, K-Nearest Neighbors (KNN), and logistic regression (LoR). We evaluated the performance of these models with the Brier score, the area under the ROC curve (AUC), accuracy, sensitivity (recall), specificity, precision, and F1-score. We further developed eight regression models for LoS prediction: Linear Regression (LR), including the penalized linear models Least Absolute Shrinkage and Selection Operator (LASSO), Ridge and Elastic-net regression, Support vector regression, RF regression, KNN, and eXtreme Gradient Boosting (XGBoost) regression. The model performances were measured by their mean square error, mean absolute error, and mean relative error. The dataset was randomly split into a training set (70%) and a validation set (30%).ResultsA total of 12,858 eligible patients were included in our study, of whom 60.88% had a PloS. The GB classifier best predicted PloS (accuracy 75%, AUC 75.4%, Brier score 0.181), followed by LoR classifier (accuracy 75%, AUC 75.2%, Brier score 0.182). These models also showed to be adequately calibrated. Ridge and XGBoost regressions best predicted LoS, with the smallest total prediction error. The overall prediction error is between 6 and 7 days, meaning there is a 6–7 day mean difference between actual and predicted LoS.ConclusionOur results demonstrate the potential of machine learning-based methods to predict LoS and provide valuable insights into the risks behind prolonged hospitalizations. In addition to physicians' clinical expertise, the results of these models can be utilized as input to make informed decisions, such as predicting hospitalizations and enhancing the overall performance of a public healthcare system.
f
Logistic regression analysis to demonstrate variables associated with...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Creed, Francis (2023). Logistic regression analysis to demonstrate variables associated with presence of psychiatric disorder in each of the six conditions (blank means that the odds ratio was not significant). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001051930
Explore at:
Dataset updated
May 30, 2023
Authors
Creed, Francis
Description
Logistic regression analysis to demonstrate variables associated with presence of psychiatric disorder in each of the six conditions (blank means that the odds ratio was not significant).
d
Stream definition 900 cell threshold rasters for Puerto Rico StreamStats
catalog.data.gov
data.usgs.gov
+2more
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Stream definition 900 cell threshold rasters for Puerto Rico StreamStats [Dataset]. https://catalog.data.gov/dataset/stream-definition-900-cell-threshold-rasters-for-puerto-rico-streamstats
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Puerto Rico
Description
The U.S. Geological Survey (USGS), in cooperation with the Puerto Rico Environmental Quality Board, has compiled a series of geospatial datasets for Puerto Rico to be implemented into the USGS StreamStats application (https://streamstats.usgs.gov/ss/). These geospatial datasets, along with basin characteristics datasets for Puerto Rico published as a separate USGS data release (https://doi.org/10.5066/P9HK9SSQ), were used to delineate watersheds and develop the peak-flow and low-flow regression equations used by StreamStats. The geospatial dataset described herein are the stream definition rasters with a 900 stream cell threshold at a 10-m resolution. The flow accumulation grid is used as input to create this dense stream grid. This requires a flow accumulation of 900 pixels or greater to initiate a stream channel. A value of 1 is assigned for all of the cells equal or greater than the threshold and no data for all other cells. Data are partitioned into four TIFF files, one for each of the four 8-digit Hydrologic Unit Code (HUC) areas for Puerto Rico: 21010002, 21010003, 21010004, and 21010005.
d
Elevation, Flow Accumulation, Flow Direction, and Stream Definition Data in...
catalog.data.gov
data.usgs.gov
+3more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Elevation, Flow Accumulation, Flow Direction, and Stream Definition Data in Support of the Illinois StreamStats Upgrade to the Basin Delineation Database [Dataset]. https://catalog.data.gov/dataset/elevation-flow-accumulation-flow-direction-and-stream-definition-data-in-support-of-the-il
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Illinois
Description
The U.S. Geological Survey (USGS), in cooperation with the Illinois Center for Transportation and the Illinois Department of Transportation, prepared hydro-conditioned geographic information systems (GIS) layers for use in the Illinois StreamStats application. These data were used to delineate drainage basins and compute basin characteristics for updated peak flow and flow duration regression equations for Illinois. This dataset consists of raster grid files for elevation (dem), flow accumulation (fac), flow direction (fdr), and stream definition (str900) for each 8-digit Hydrologic Unit Code (HUC) area in Illinois merged into a single dataset. There are 51 full or partial HUC 8s represented by this data set: 04040002, 05120108, 05120109, 05120111, 05120112, 05120113, 05120114, 05120115, 05140202, 05140203, 05140204, 05140206, 07060005, 07080101, 07080104, 07090001, 07090002, 07090003, 07090004, 07090005, 07090006, 07090007, 07110001, 07110004, 07110009, 07120001, 07120002, 07120004 (0712003 was combined into this HUC), 07120005, 07120006, 07120007, 07130001, 07130002, 07130003, 07130004, 07130005, 07130006, 07130007, 07130008, 07130009, 07130010, 07130011, 07130012, 07140101, 07140105, 07140106, 07140108, 07140201, 07140202, 07140203, and 07140204.

Facebook

Twitter

Click to copy link

Link copied

Cite

The citation is currently not available for this dataset.

Data Mining For Business

Linear Regression +K Means Clustering

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 7, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Balal H

Description

Dataset

This dataset was created by Balal H

Clear search

Close search

Google apps

Main menu

Data Mining For Business

Dataset

Contents

Baseline Definition - Dataset - Open Data NI

Dataset for: Some Remarks on the R2 for Clustering

Jigsaw Regression Based Data

Data Files

Content

File 1 : This File is in CSV format contains two columns

File 2 : This File is in CSV format contains two columns

File 3 : This File is in CSV format contains two columns

Folder 1 : This folder contains 2 files of 100D fasttext word embeddings.

Folder 2 : This folder contains 4 files of 256D fasttext word embeddings.

Continuous bag of words (CBOW)

Performance of ML models on test data.

Regression variable means (and standard errors) stratified by study village....

NWM GWR code and data

Organic Food Consumers Dataset

Educational Attainment in North Carolina Public Schools: Use of statistical...

Strengths and weaknesses of different methods.

Democracy time-series dataset: Variable labels.

Adversarial validation for quantifying dissimilarity in geospatial machine...

Statistical analysis for: Mode I fracture of beech-adhesive bondline at...

Replication Data for: \"A Topic-based Segmentation Model for Identifying...

Stream definition 900 cell threshold raster for Pennsylvania StreamStats

Dataset for: Statistical simulation of ocean current patterns using...

Data_Sheet_1_Machine learning-based prediction of hospital prolonged length...

Logistic regression analysis to demonstrate variables associated with...

Stream definition 900 cell threshold rasters for Puerto Rico StreamStats

Elevation, Flow Accumulation, Flow Direction, and Stream Definition Data in...

Data Mining For Business

Linear Regression +K Means Clustering

Dataset

Contents