Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Cirrhosis Prediction dataset is intended for the advancement of machine learning models to predict the stage of liver cirrhosis. It contains various clinical features, which are vital for prognosis and treatment strategies.
2) Data Utilization (1) Cirrhosis Prediction data has characteristics that: • It includes clinical data like liver biochemistry, demographic details, and histology grading. • The dataset aids in developing predictive models for staging liver cirrhosis, potentially improving patient outcomes. (2) Cirrhosis Prediction data can be used to: • Medical Research: It is used in developing algorithms for early detection and progression tracking of liver cirrhosis. • Healthcare Strategy: Assists in forming medical interventions and managing treatment plans for patients.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Challenge Details: In this data-driven hackathon, participants will develop machine learning models to predict the lean_body_mass based on Lean Body Mass Data.
Submission and Evaluation Submission Format: Participants must submit their predictions in the format specified in submission.csv. Evaluation Metric: Submissions will be evaluated based on the R2_Score , measuring how well the model predict the lean_body_mass.
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data introduction •Patient dataset aimed to develop and validate a prediction model for all-cause in-hospital mortality in hospitalized patients.
2) Data utilization (1) Patient data has characteristics that: • This is a csv file consisting of 85 columns consisting of variables such as age, BMI, and ethnicity. Based on these factors, the patient's survival is predicted. (2) Patient data can be used to: • Personalized medicine: Insights gained from data can support the development of personalized treatment plans that tailor interventions to individual patient needs based on predicted survival probabilities. • Healthcare Management: Data can help manage healthcare by predicting patient outcomes, planning for future healthcare needs, and improving overall patient care strategies.
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Orbit Classification For Prediction Dataset focuses on predicting the classes of orbits for celestial objects. This dataset includes parameters such as semi-major axis, eccentricity, inclination, argument of perihelion, and more. It provides a comprehensive overview for orbit classification and prediction.
2) Data Utilization (1) Orbit data has characteristics that: • It allows for detailed analysis and classification of orbits based on several orbital parameters, aiding in the prediction and understanding of celestial objects' orbits. (2) Orbit data can be used to: • Astronomy and Space Research: Useful for astronomers and researchers to classify and predict the orbits of celestial objects, aiding in space exploration and study. • Educational Purposes: Assists in academic studies related to celestial mechanics and orbital dynamics. • Technology Development: Supports the development of algorithms and AI models for orbit prediction and classification.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
📣 Challenge Details: In this data-driven hackathon, participants will develop machine learning models to predict the BeatsPerMinute based on Music Track BPM Data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundLiver transplantation (LT) is one of the main curative treatments for hepatocellular carcinoma (HCC). Milan criteria has long been applied to candidate LT patients with HCC. However, the application of Milan criteria failed to precisely predict patients at risk of recurrence. As a result, we aimed to establish and validate a deep learning model comparing with Milan criteria and better guide post-LT treatment.MethodsA total of 356 HCC patients who received LT with complete follow-up data were evaluated. The entire cohort was randomly divided into training set (n = 286) and validation set (n = 70). Multi-layer-perceptron model provided by pycox library was first used to construct the recurrence prediction model. Then tabular neural network (TabNet) that combines elements of deep learning and tabular data processing techniques was utilized to compare with Milan criteria and verify the performance of the model we proposed.ResultsPatients with larger tumor size over 7 cm, poorer differentiation of tumor grade and multiple tumor numbers were first classified as high risk of recurrence. We trained a classification model with TabNet and our proposed model performed better than the Milan criteria in terms of accuracy (0.95 vs. 0.86, p < 0.05). In addition, our model showed better performance results with improved AUC, NRI and hazard ratio, proving the robustness of the model.ConclusionA prognostic model had been proposed based on the use of TabNet on various parameters from HCC patients. The model performed well in post-LT recurrence prediction and the identification of high-risk subgroups.
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Campus Placement Prediction: Binary Classification dataset encapsulates a comprehensive array of attributes for predicting the outcome of candidate selection during campus deployment.
2) Data Utilization (1) Campus Placement Prediction: Binary Classification data has characteristics that: • The dataset includes various socioeconomic factors such as serial numbers, gender, secondary and higher education, university education, jobs, employability, etc. (2) Campus Placement Prediction: Binary Classification data can be used to: • Development of predictive models: useful for developing machine learning models that predict batch outcomes based on the attributes of a given candidate. • Characteristic Importance Analysis: Helps you determine which candidate properties have the greatest impact on the placement results.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
📣 Challenge Details: Your goal is to Build a machine learning model to predict fuel_efficiency_kmpl using used cars data.
Data Description The dataset for this hackathon includes:
train.csv: Contains used cars data. test.csv: Contains data for testing. submission.csv: The format in which your predictions should be submitted.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Predict customer churn for credit card companny base on given features. You can use Machine Learning as well as Deep LEarning techniques to produce some meaningfull outputs. This dataset very basic and can be used for basic understanding.
Facebook
TwitterPlease review Zhang et al. (2021) for details on study design and datasets (https://doi.org/10.1016/j.watres.2022.118443). In summary, predictor and response variable data was acquired from the Chesapeake Bay Program and USGS. This data was subjected to a trend analysis to estimate the MK linear slope change for both predictor and response variables. After running a cluster analysis on the scaled TN loading time series (the response variable), the cluster assignment was paired with the slope estimates from the suite of predictor variables tied to the nutrient inventory and static geologic and land use variables. From there, an RF analysis was executed to link trends in anthropogenic driver and other contextual environmental factors to the identified trend cluster types. After calibrating the RF model, likelihood of improving, relatively static, or degrading catchments across the Chesapeake Bay were identified for the 2007 to 2018 period. Tabular data is available on the journal website and PUBMED, and the predictor/response variable data can be downloaded individually in the USGS and Chesapeake Bay Program links listed in the data access section. Portions of this dataset are inaccessible because: This data was generate by other federal entities and are housed in their respective data warehouse domains (e.g., USGS and Chesapeake Bay Program). Furthermore, the data can be accessed on the journal website as well as NCBI PUBMED (https://pubmed.ncbi.nlm.nih.gov/35461100/). They can be accessed through the following means: Combined dataset can be accessed on the journal website (https://www.sciencedirect.com/science/article/pii/S0043135422003979?via%3Dihub#ack0001) and will soon be available on NCBI (https://pubmed.ncbi.nlm.nih.gov/35461100/). The predictor variable data can be accessed from the Chesapeake Bay Program (https://cast.chesapeakebay.net/) and USGS (https://pubs.er.usgs.gov/publication/ds948 and https://www.sciencebase.gov/catalog/item/5669a79ee4b08895842a1d47). Format: Please review Zhang et al. (2021) for details on study design and datasets (https://doi.org/10.1016/j.watres.2022.118443). In summary, predictor and response variable data was acquired from the Chesapeake Bay Program and USGS. This data was subjected to a trend analysis to estimate the MK linear slope change for both predictor and response variables. After running a cluster analysis on the scaled TN loading time series (the response variable), the cluster assignment was paired with the slope estimates from the suite of predictor variables tied to the nutrient inventory and static geologic and land use variables. From there, an RF analysis was executed to link trends in anthropogenic driver and other contextual environmental factors to the identified trend cluster types. After calibrating the RF model, likelihood of improving, relatively static, or degrading catchments across the Chesapeake Bay were identified for the 2007 to 2018 period. Tabular data is available on the journal website and PUBMED, and the predictor/response variable data can be downloaded individually in the USGS and Chesapeake Bay Program links listed in the data access section. This dataset is associated with the following publication: Zhang, Q., J. Bostic, and R. Sabo. Regional patterns and drivers of total nitrogen trends in the Chesapeake Bay watershed: Insights from machine learning approaches and management implications. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 218: 1-15, (2022).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prediction Data of Base Models from Auto-Sklearn 1 on 71 classification datasets from the AutoML Benchmark for Balanced Accuracy and ROC AUC.
The files of this figshare item include data that was collected for the paper:
Q(D)O-ES: Population-based Quality (Diversity) Optimisation for Post Hoc Ensemble Selection in AutoML, Lennart Purucker, Lennart Schneider, Marie Anastacio, Joeran Beel, Bernd Bischl, Holger Hoos, Second International Conference on Automated Machine Learning, 2023.
The data was stored and used with the assembled framework: https://github.com/ISG-Siegen/assembled.
In detail, the data contains the predictions of base models on validation and test as produced by running Auto-Sklearn 1 for 4 hours. Such prediction data is included for each model produced by Auto-Sklearn 1 on each fold of 10-fold cross-validation on the 71 classification datasets from the AutoML Benchmark. The data exists for two metrics (ROC AUC and Balanced Accuracy). More details can be found in the paper.
The data was collected by code created for the paper and is available in its reproducibility repository: https://doi.org/10.6084/m9.figshare.23613624.
Its usage is intended for but not limited to using assembled to evaluate post hoc ensembling methods for AutoML.
Details The link above points to a hosted server that facilitates the download. We opted for a hosted server, as we found no other suitable solution to share these large files (due to file size or storage limits) for a reasonable price. If you want to obtain the data in another way or know of a more suitable alternative, please contact Lennart Purucker.
The link resolves to a directory containing the following:
example_metatasks: contains an example metatask for test purposes before committing to downloading all files.
metatasks_roc_auc.zip: The Metatasks obtained by running Auto-Sklearn 1 for ROC AUC.
metatasks_bacc.zip: The Metatasks obtained by running Auto-Sklearn 1 for Balanced Accuracy.
The size after unzipping the entire file is:
metatasks_roc_auc.zip: ~450GB metatasks_bacc.zip: ~330GB
We suggest extracting only files that are of interest from the .zip archive, as these can be much smaller in size and might suffice for experiments.
The metatask .zip files contain 2 subdirectories for Metatasks produced based on TopN or SiloTopN pruning (see paper for details). In each of these subdirectories, 2 files per metatask exist. One .json file with metadata information and a .hdf or .csv file containing the prediction data. The details on how this should be read and used as a Metatask can be found in the assembled framework and the reproducibility repository. To obtain the data without Metataks, we advise looking at the file content and metadata individually or parsing them by using Metatasks first.
Facebook
TwitterA histrogram-based boosted regression tree (HBRT) method was used to predict the depth to the surficial aquifer water table (in feet) throughout the State of Wisconsin. This method used a combination of discrete groundwater levels from the U.S. Geological Survey National Water Information System, continuous groundwater levels from the National Groundwater Monitoring Network, the State of Wisconsin well-construction database, and NHDPlus version 2.1-derived points. The predicted water table depth utilized the HBRT model available through Scikit-learn in Python version 3.10.10. The HBRT model can predict the surficial water table depth for any latitude and longitude for Wisconsin. A total of 48 predictor variables were used for model development, including basic well characteristics, soil properties, aquifer properties, hydrologic position on the landscape, recharge and evapotranspiration rates, and bedrock characteristics. Model results indicate that the mean surficial water table depth across Wisconsin is 28.3 feet below land surface, with a root mean square error of 7.40 feet for the holdout data to the HBRT model. Aside from the overall HBRT methods contained as part of the Python script, this data release includes a self-contained model directory for recreating the HBRT model published in this data release. The model directory also includes a model object for the HBRT model used to predict the surficial aquifer water table depth (in feet) for the State of Wisconsin. Three separate directories are available within this data release that define the input predictor variables, water levels, and NHD points for the HBRT model. The 'bedrock-overlay' sub-directory contains geospatial data that define the special selection zones used in the depth-to-water well selection (DTW_well_selection_zones.docx). The 'water-levels' sub-directory contains input files for the NHDPlus version 2.1 points, the State of Wisconsin well construction spreadsheets, and water level summary files. The 'python-attributes' sub-directory contains predictor variable rasters and vector data that predict the surficial water table depth for Wisconsin and a Jupyter Notebook used for the attribution and input files for well and NHD points.
Facebook
TwitterIntroductionMachine learning (ML) is an effective tool for predicting mental states and is a key technology in digital psychiatry. This study aimed to develop ML algorithms to predict the upper tertile group of various anxiety symptoms based on multimodal data from virtual reality (VR) therapy sessions for social anxiety disorder (SAD) patients and to evaluate their predictive performance across each data type.MethodsThis study included 32 SAD-diagnosed individuals, and finalized a dataset of 132 samples from 25 participants. It utilized multimodal (physiological and acoustic) data from VR sessions to simulate social anxiety scenarios. This study employed extended Geneva minimalistic acoustic parameter set for acoustic feature extraction and extracted statistical attributes from time series-based physiological responses. We developed ML models that predict the upper tertile group for various anxiety symptoms in SAD using Random Forest, extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and categorical boosting (CatBoost) models. The best parameters were explored through grid search or random search, and the models were validated using stratified cross-validation and leave-one-out cross-validation.ResultsThe CatBoost, using multimodal features, exhibited high performance, particularly for the Social Phobia Scale with an area under the receiver operating characteristics curve (AUROC) of 0.852. It also showed strong performance in predicting cognitive symptoms, with the highest AUROC of 0.866 for the Post-Event Rumination Scale. For generalized anxiety, the LightGBM’s prediction for the State-Trait Anxiety Inventory-trait led to an AUROC of 0.819. In the same analysis, models using only physiological features had AUROCs of 0.626, 0.744, and 0.671, whereas models using only acoustic features had AUROCs of 0.788, 0.823, and 0.754.ConclusionsThis study showed that a ML algorithm using integrated multimodal data can predict upper tertile anxiety symptoms in patients with SAD with higher performance than acoustic or physiological data obtained during a VR session. The results of this study can be used as evidence for personalized VR sessions and to demonstrate the strength of the clinical use of multimodal data.
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Salary data aims to determine whether individuals earn less than or more than $50,000 annually based on their employment, education, and demographic information. It is used widely in analyses that seek to understand income disparities and economic factors influencing earnings.
2) Data Utilization (1) Salary data has characteristics that: • The dataset includes factors such as age, education, job type, hours worked per week, and other socio-economic variables that contribute to predicting salary categories. (2) Salary data can be used to: • Workforce Analysis: Useful for employers and policymakers to understand wage structures and adjust compensation plans accordingly. • Economic Research: Helps researchers analyze economic mobility and the impact of education and employment on income levels.
Facebook
TwitterThis dataset contains the predicted prices of the asset TABLE over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Facebook
TwitterA machine learning streamflow (MLFLOW) model was developed in R (model is in the Rscripts folder) for modeling monthly streamflow from 2012 to 2017 in three watersheds on the Wyoming Range in the upper Green River basin. Geospatial information for 125 site features (vector data are in the Sites.shp file) and discrete streamflow observation data and environmental predictor data were used in fitting the MLFLOW model and predicting with the fitted model. Tabular calibration and validation data are in the Model_Fitting_Site_Data.csv file, totaling 971 discrete observations and predictions of monthly streamflow. Geospatial information for 17,518 stream grid cells (raster data are in the Streams.tif file) and environmental predictor data were used for continuous streamflow predictions with the MLFLOW model. Tabular prediction data for all the study area (17,518 stream grid cells) and study period (72 months; 2012–17) are in the Model_Prediction_Stream_Data.csv file, totaling 1,261,296 predictions of spatially and temporally continuous monthly streamflow. Additional information about the datasets is in the metadata included in the four zipped dataset files and about the MLFLOW model is in the readme included in the zipped model archive folder.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 1: Table 1 Reference List. Title of data: Journal articles included in the review, ordered alphabetically by first author surname. Description of data: A list of all articles included in the review, including author names, title of publication, journal of publication, volume, pages and DOI number.
Facebook
TwitterThis dataset contains the predicted prices of the asset Round Table over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Facebook
TwitterBackgroundIn the context of the rapidly aging global population, sarcopenic obesity (SO) in older adults is associated with significantly higher rates of disability and mortality. SO has become a serious and critical public health concern. This study aimed to develop and validate predictive models using machine learning (ML) to identify SO in patients.MethodsData from 386 participants collected at the Affiliated Hospital of Qingdao University were divided into an 8:2 ratio, with 80% used for training and 20% for testing. Univariate analysis was performed to identify the factors correlated with SO, and multivariate logistic regression analysis was performed to determine the independent factors influencing SO. The Shapley Additive exPlanations (SHAP) diagram was used to illustrate the importance of variables in the model. To develop a predictive model for SO, we used five models and applied internal five-fold cross-validation to determine the most suitable hyperparameters for the model.ResultsAmong 386 participants, 61 were diagnosed with sarcopenic obesity (15.8%). We identified four independent predictive factors, namely BMI, Barthel Index score, grip strength, and calf circumference. Notably, calf circumference plays an important role in assessing the risk of SO in older adults. The area under the curve (AUC) values of the test set for the Random forest (RF), naive Bayes (NB), Light Gradient Boosting Machine (LightGBM), k-nearest neighbor algorithm (KNN), and eXtreme Gradient Boosting (XGBoost) models were recorded as 0.839, 0.815, 0.808, 0.794, and 0.798, respectively. Among these models, the RF model exhibited the best average performance in the training set, with an AUC value of 0.839.ConclusionWe constructed a predictive model based on the results of the RF model, combining four clinical predictors—BMI, Barthel Index score, grip strength, and calf circumference—to reliably predict SO.
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Cirrhosis Prediction dataset is intended for the advancement of machine learning models to predict the stage of liver cirrhosis. It contains various clinical features, which are vital for prognosis and treatment strategies.
2) Data Utilization (1) Cirrhosis Prediction data has characteristics that: • It includes clinical data like liver biochemistry, demographic details, and histology grading. • The dataset aids in developing predictive models for staging liver cirrhosis, potentially improving patient outcomes. (2) Cirrhosis Prediction data can be used to: • Medical Research: It is used in developing algorithms for early detection and progression tracking of liver cirrhosis. • Healthcare Strategy: Assists in forming medical interventions and managing treatment plans for patients.