Facebook
TwitterThis dataset contains model outputs that were analyzed to produce the main results of the paper.
Facebook
TwitterGroundwater is a vital resource in the Mississippi embayment of the central United States. An innovative approach using machine learning (ML) was employed to predict groundwater salinity—including specific conductance (SC), total dissolved solids (TDS), and chloride (Cl) concentrations—across three drinking-water aquifers of the Mississippi embayment. A ML approach was used because it accommodates a large and diverse set of explanatory variables, does not assume monotonic relations between predictors and response data, and results can be extrapolated to areas of the aquifer not sampled. These aspects of ML allowed potential drivers and sources of high salinity water that have been hypothesized in other studies to be included as explanatory variables. The ML approach integrated output from a groundwater-flow model and water-quality data to predict salinity, and the approach can be applied to other aquifers to provide context for the long-term availability of groundwater resources. The Mississippi embayment includes two principal regional aquifer systems; the surficial aquifer system, dominated by the Quaternary Mississippi River Valley Alluvial aquifer (MRVA), and the Mississippi embayment aquifer system, which includes deeper Tertiary aquifers and confining units. Based on the distribution of groundwater use for drinking water, the modeling focused on the MRVA, middle Claiborne aquifer (MCAQ), and lower Claiborne aquifer (LCAQ). Boosted regression tree (BRT) models (Elith and others, 2008; Kuhn and Johnson, 2013) were developed to predict SC and Cl to 1-kilometer (km) raster grid cells of the National Hydrologic Grid (Clark and others, 2018) for 7 aquifer layers (1 MRVA, 4 MCAQ, 2 LCAQ) following the hydrogeologic framework of Hart and others (2008). TDS maps were created using the correlation between SC and TDS. Explanatory variables for the BRT models included attributes associated with well location and construction, surficial variables (such as soils and land use), and variables extracted from a MODFLOW groundwater flow model for the Mississippi embayment (Haugh and others, 2020a; Haugh and others, 2020b). Prediction intervals were calculated for SC and Cl by bootstrapping raster-cell predictions following methods from Ransom and others (2017). For a full description of modeling workflow and final model selection see Knierim and others (2020).
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE. Documented on April 29,2025. Electroencephalogram (EEG) data recorded from invasive and scalp electrodes. The EEG database contains invasive EEG recordings of 21 patients suffering from medically intractable focal epilepsy. The data were recorded during an invasive pre-surgical epilepsy monitoring at the Epilepsy Center of the University Hospital of Freiburg, Germany. In eleven patients, the epileptic focus was located in neocortical brain structures, in eight patients in the hippocampus, and in two patients in both. In order to obtain a high signal-to-noise ratio, fewer artifacts, and to record directly from focal areas, intracranial grid-, strip-, and depth-electrodes were utilized. The EEG data were acquired using a Neurofile NT digital video EEG system with 128 channels, 256 Hz sampling rate, and a 16 bit analogue-to-digital converter. Notch or band pass filters have not been applied. For each of the patients, there are datasets called ictal and interictal, the former containing files with epileptic seizures and at least 50 min pre-ictal data. the latter containing approximately 24 hours of EEG-recordings without seizure activity. At least 24 h of continuous interictal recordings are available for 13 patients. For the remaining patients interictal invasive EEG data consisting of less than 24 h were joined together, to end up with at least 24 h per patient. An interdisciplinary project between: * Epilepsy Center, University Hospital Freiburg * Bernstein Center for Computational Neuroscience (BCCN), Freiburg * Freiburg Center for Data Analysis and Modeling (FDM).
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
jfo150/stems-predict-data dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Multiple modeling frameworks were used to predict daily temperatures at 0.5m depth intervals for a set of diverse lakes in the U.S. states of Minnesota and Wisconsin. General Lake Model verion 2 process-Based (PB) models were configured and calibrated with training data to reduce root-mean squared error for 449 lakes (PBALL). Uncalibrated models used default configurations (PB0; see Winslow et al. 2016 for details) and no parameters were adjusted according to model fit with observations for 7,150 lakes.
Facebook
TwitterThis dataset contains the predicted prices of the asset predict over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Facebook
TwitterPatient database that contains EEG data sets, executable tasks, and computational tools., THIS RESOURCE IS NO LONGER IN SERVICE. Documented on September 16,2025.
Facebook
TwitterThis dataset contains the predicted prices of the asset Predict Crypto over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Facebook
TwitterThis dataset was created by Luigi Perotti Souza
Facebook
TwitterMultiple modeling frameworks were used to predict daily temperatures at 0.5m depth intervals for a set of diverse lakes in the U.S. states of Minnesota and Wisconsin. Process-Based (PB) models were configured and calibrated with training data to reduce root-mean squared error. Uncalibrated models used default configurations (PB0; see Winslow et al. 2016 for details) and no parameters were adjusted according to model fit with observations. Deep Learning (DL) models were Long Short-Term Memory artificial recurrent neural network models which used training data to adjust model structure and weights for temperature predictions (Jia et al. 2019). Process-Guided Deep Learning (PGDL) models were DL models with an added physical constraint for energy conservation as a loss term. These models were pre-trained with uncalibrated Process-Based model outputs (PB0) before training on actual temperature observations.
Facebook
TwitterThis metadata record describes outputs from 12 configurations of long short-term memory (LSTM) models which were used to predict streamflow drought occurrence at 384 stream gage locations in the Colorado River Basin region. The models were trained on data from 01-Oct-1981 to 31-Mar-2005 and validated over the period of record spanning 01-Apr-2005 to 31-Mar- 2014. The models use explanatory variable inputs described in Wieczorek (2023) (doi.org/10.5066/P98IG8LO) to predict daily streamflow and streamflow percentiles as described in Simeone (2022) (doi.org/10.5066/P92FAASD). Separate models were trained to predict daily streamflow and streamflow percentiles. Two types of percentiles were modeled: (1) fixed-threshold percentiles that are based on comparing all streamflow throughout the year, and (2) variable-threshold percentiles that compare streamflow separately for each day of the year (using a moving 30-day window). Separate models were trained for predicting at lead times of 0, 7 and 14 days ahead. Details on methods and model configurations can be found in Hamshaw and others (2023). The comma separated files are grouped by target variables and lead times as listed in the table below and include model output for the validation period (01-Apr-2005 to 31-Mar-2014). This metadata record also includes model code (see Readme.txt within the CRB_NN_model_archive.zip for more details) and a model performance metrics file (model_validation_performance_metrics_by_gage.csv).
| Data File | Prediction target variable | Forecast lead time | Model Configurations |
|---|---|---|---|
| streamflow_model_predictions_0day_ahead.csv | Daily Streamflow (mm/day) | 0 days | Streamflow-0d, PUB-Streamflow-0d |
| streamflow_model_predictions_7day_ahead.csv | Daily Streamflow (mm/day) | 7 days | Streamflow-7d |
| streamflow_model_predictions_14day_ahead.csv | Daily Streamflow (mm/day) | 14 days | Streamflow-14d |
| percentile_fixed_model_predictions_0day_ahead.csv | Fixed Percentile | 0 days | Fixed-0d, PUB-Fixed-0d Q-to-Fixed-0d |
| percentile_fixed_model_predictions_7day_ahead.csv | Fixed Percentile | 7 days | Fixed-7d |
| percentile_fixed_model_predictions_14day_ahead.csv | Fixed Percentile | 14 days | Fixed-14d |
| percentile_variable_model_predictions_0day_ahead.csv | Variable Percentile | 0 days | Variable-0d, PUB-Variable-0d, Q-to-Variable-0d |
| percentile_variable_model_predictions_7day_ahead.csv | Variable Percentile | 7 days | Variable-7d |
| percentile_variable_model_predictions_14day_ahead.csv | Variable Percentile | 14 days | Variable-14d |
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Gene-gene chromatin interactions (GGIs) bring distal genes into close spatial proximity to permit strong co-expression, which could potentially contribute to cancer progression. High-throughput methods like Hi-C are impractical for very large cohort analyses, thus we developed AI4Loop, an Artificial Intelligence (AI) Deep Learning -based tool to predict GGIs using RNA-Seq data. Applying AI4Loop to 12,000 patient samples from the TCGA database across 32 cancer types revealed that GGIs show increased cancer sub-type predictivity compared to RNA-Seq data and demonstrated oncogenic gains of GGIs interaction in almost all cancers examined. To target the therapeutic vulnerability of gain of GGIs in cancers, using low-information RNA expression datasets from the CLUE database, we also constructed a drug-perturbation GGI atlas from 50,000 drug-treated samples to identify and repurposed compounds that disrupt oncogenic GGIs. Notably, we found that the antibiotics eperezolid and radezolid reduced cancer-acquired GGIs, which we confirmed with Hi-C experiment. This work showcases AI-directed research in epigenetics, enhances cancer biology predictivity and can promote wide-range drug repurposing in the future.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Predicting which specific parts of a video users will replay is important for several applications, including targeted advertisement placement on video platforms and assisting video creators. In this work, we explore whether it is possible to predict the Most Replayed (MR) data from YouTube videos. To this end, we curate a large video benchmark, the YTMR500 dataset, which comprises 500 YouTube videos with MR data annotations. We evaluate Deep Learning (DL) models of varying complexity on our dataset and perform an extensive ablation study. In addition, we conduct a user study to estimate the human performance on MR data prediction. Our results show that, although by a narrow margin, all the evaluated DL models outperform random predictions. Additionally, they exceed human-level accuracy. This suggests that predicting the MR data is a difficult task that can be enhanced through the assistance of DL. In this repository, we provide our code and dataset. The code includes our trained and tested models, our user studies and results analysis. The YTMR500 dataset is provided through an H5 file.
Facebook
TwitterPublicly available data about potential PFAS sources and PFAS measurements in fish tissue. This dataset is associated with the following publication: DeLuca, N., A. Mullikin, P. Brumm, A. Rappold, and E. Hubal. Using Geospatial Data and Random Forest To Predict PFAS Contamination in Fish Tissue in the Columbia River Basin, United States. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 57: 14024-14035, (2023).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset can be used to predict the stock market. The data is extracted from MT5 terminal integrated in python.
The datasets include the minute by minute fluctuations of Gold and Silver prices over from 1st of January 2023 to 17th April 2025. The data can be used to train models for seasonality or a minute-by-minute approach.
The data has 7 columns:
Two datasets are used;
Achilles Data Gold-Silver: with 1,416,340 rows to predict Gold, Silver and other Metals. Achilles Data Gold: with 708,264 rows to predict Gold, Silver and other Metals.
You may find the paper of our implementation here: https://doi.org/10.48550/arXiv.2410.21291
Facebook
TwitterThis dataset contains the predicted prices of the asset Vibe Predict over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Multiple modeling frameworks were used to predict daily temperatures at 0.5m depth intervals for a set of diverse lakes in the U.S. states of Minnesota and Wisconsin. Process-Based (PB) models were configured and calibrated with training data to reduce root-mean squared error. Uncalibrated models used default configurations (PB0; see Winslow et al. 2016 for details) and no parameters were adjusted according to model fit with observations. Deep Learning (DL) models were Long Short-Term Memory artificial recurrent neural network models which used training data to adjust model structure and weights for temperature predictions (Jia et al. 2019). Process-Guided Deep Learning (PGDL) models were DL models with an added physical constraint for energy conservation as a loss term. These models were pre-trained with uncalibrated Process-Based model outputs (PB0) before training on actual temperature observations. Zip files for each lake contain four files, one for each of PB, PB0, DL, and PG ...
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Fungal and arthropod consumers constitute the vast majority of global terrestrial biodiversity. Yet, the link from richness and composition of producer (plant) communities to the richness of consumer communities is poorly understood. Fungal and arthropod species richness could be a simple function of producer species richness at a site. Alternatively, it could be a complex function of chemical and structural properties of the producer species making up communities. We used databases on plant-fungus and plant-arthropod trophic links to derive the richness of consumer biota per associated plant species (coined link score). We assessed how well link scores could be predicted by simple attributes of plant species. Next, we used a multi-taxon inventory of 130 sites, representing all major habitat types in a country (Denmark), to investigate whether link scores summed over plant species in communities (coined link sum) could outperform simple plant species richness as predictor of fungal and arthropod richness at the sites. We found plant species’ link scores for both fungi and arthropods to be positively related to plant size, regional occupancy, nativeness and ectomycorrhizal status. Link-based indices generally improved the prediction of richness of fungal and arthropod communities. For fungal communities, both observed link sum (from databases) and predicted link sum (from plant attributes) had high predictive power, while plant richness alone had none. For arthropod communities, predictive performance varied between functional groups. For both fungi and arthropods, richness predictions were further improved by considering abiotic habitat conditions. Our results underline the importance of plants as niche space for the megadiverse groups of arthropods and fungi. The plant-attribute approach holds promise for predicting local and regional consumer richness in areas of the world lacking detailed plant-consumer databases. Methods Data on plant-fungus and plant-arthropod interaction links for the 549 plant species found across the 130 BioWide sites in Denmark. Detailed descriptions of field data collection protocols are found in Brunbjerg AK, Bruun HH, Brøndum L, Classen AT, Dalby L, Fog K, et al. (2019) A systematic survey of regional multi-taxon biodiversity: evaluating strategies and coverage. BMC Ecology 19(1):43. doi: 10.1186/s12898-019-0260-x. Raw data on known interaction links between all relevant plant taxa (1349 taxa on the species or genus levels) and associated arthropod species were retrieved from the BRC database (https://www.brc.ac.uk/dbif/) and similar data regarding associated fungal species from the Danish Fungal Database (https://svampe.databasen.org/). The raw data were processed to obtain an observed arthropod link score and an observed fungal link score per plant species. The calculus of link scores from raw data is detailed in the associated manuscript. Attributes of the 549 plant species used to model their predicted link score (ectomycorrhizal status, native area, occupancy in Denmark, phylogenetic grouping, lifespan, life form and size) were compiled from sources detailed in the associated manuscript.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate prediction of recurrent clinical events is crucial for effective management of chronic conditions such as cancer and cardiovascular disease. In recent years, longitudinal health informatics databases, which routinely collect data on repeated clinical events, have been increasingly used to construct risk prediction models. We introduce a novel nonparametric framework to predict recurrent events on a gap time scale using survival tree ensembles. Our framework incorporates two predictive modeling strategies: episode-specific model and global model. These models avoid strong assumptions on how future event risk depends on previous event history and other predictors, making them a promising alternative to Cox-type models. Additional complexities in tree-based prediction for recurrent events include induced informative censoring of gap times and inter-event correlations. We develop algorithms to address these issues through the use of inverse probability of censoring weighting and modified resampling procedures. Applied to SEER-Medicare data to predict repeated hospitalizations for breast cancer patients, our models showed superior performance. In particular, borrowing information across events via global models substantially improved prediction accuracy for later hospitalizations. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset provides a comprehensive view of students enrolled in various undergraduate degrees offered at a higher education institution. It includes demographic data, social-economic factors and academic performance information that can be used to analyze the possible predictors of student dropout and academic success. This dataset contains multiple disjoint databases consisting of relevant information available at the time of enrollment, such as application mode, marital status, course chosen and more. Additionally, this data can be used to estimate overall student performance at the end of each semester by assessing curricular units credited/enrolled/evaluated/approved as well as their respective grades. Finally, we have unemployment rate, inflation rate and GDP from the region which can help us further understand how economic factors play into student dropout rates or academic success outcomes. This powerful analysis tool will provide valuable insight into what motivates students to stay in school or abandon their studies for a wide range of disciplines such as agronomy, design, education nursing journalism management social service or technologies
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can be used to understand and predict student dropouts and academic outcomes. The data includes a variety of demographic, social-economic and academic performance factors related to the students enrolled in higher education institutions. The dataset provides valuable insights into the factors that affect student success and could be used to guide interventions and policies related to student retention.
Using this dataset, researchers can investigate two key questions: - which specific predictive factors are linked with student dropout or completion? - how do different features interact with each other? For example, researchers could explore if there any demographic characteristics (e.g., gender, age at enrollment etc.) or immersion conditions (e.g., unemployment rate in region) are associated with higher student success rates, as well as understand what implications poverty has for educational outcomes. By answering these questions, research insight is generated which can provide critical information for administrators on formulating strategies that promote successful degree completion among students from diverse backgrounds in their institutions.
In order to use this dataset effectively it is important that scientists familiarize themselves with all variables provided in the dataset including categorical (qualitative) variables such as gender or application mode; numerical variables such as number of curricular units at the beginning of semesters or age at enrollment; ordinal data measurement type variables such as marital status; studied trends over time such as inflation rate or GDP; frequency measurements variables like percentage of scholarship holders; etc.. Additionally scientists should make sure they aware off all potential bias included in the data prior running analysis–for example understanding if one population is underrepresented compared another -as this phenomenon could lead unexpected results if not taken into consideration while conducting research undertaken using this data set.. Finally it would be important for practitioners realize that this current Kaggle Dataset contains only one semester-worth information on each admission intake whereas additional studies conducted for a longer time period might be able provide more accurate results related selected topic area due further deterioration retention achievement coefficients obtained from those gradually accurate experiments unfolding different year-long admissions seasons
- Prediction of Student Retention: This dataset can be used to develop predictive models that can identify student risk factors for dropout and take early interventions to improve student retention rate.
- Improved Academic Performance: By using this data, higher education institutions could better understand their students' academic progress and identify areas of improvement from both an individual and institutional perspective. This will enable them to develop targeted courses, activities, or initiatives that enhance academic performance more effectively and efficiently.
- Accessibility Assistance: Using the demographic information included in the dataset, institutions could develop s...
Facebook
TwitterThis dataset contains model outputs that were analyzed to produce the main results of the paper.