Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.
Facebook
TwitterMarket basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Clemens, Michael A., and Tiongson, Erwin R., (2017) "Split Decisions: Household Finance When a Policy Discontinuity Allocates Overseas Work." Review of Economics and Statistics 99:3, 531-543.
Facebook
TwitterNews: Now with a 10.0 Kaggle usability score: supplemental metadata.csv file added to dataset.
Overview: This is an improved machine-learning-ready glaucoma dataset using a balanced subset of standardized fundus images from the Rotterdam EyePACS AIROGS [1] set. This dataset is split into training, validation, and test folders which contain 4000 (~84%), 385 (~8%), and 385 (~8%) fundus images in each class respectively. Each training set has a folder for each class: referable glaucoma (RG) and non-referable glaucoma (NRG). This dataset is designed to easily benchmark your glaucoma classification models in Kaggle. Please make a contribution in the code tab, I have created a template to make it even easier!
Please cite the dataset and at least the first of my related works if you found this dataset useful!
Improvements from v1: - According to an ablation study on the image standardization methods applied to dataset v1 [3], images are standardized according to the CROP methodology (remove black background before resizing). This method yields more of the actual fundus foreground in the resultant image. - Increased the image resize dimensions from 256x256 pixels to 512x512 pixels - Reason: Provides greater model input flexibility, detail, and size. This also better supports the ONH-cropping models. - Added 3000 images from the Rotterdam EyePACS AIROGS dev set - Reason: More data samples can improve model generalizability - Readjusted train/val/test split - Reason: The validation and test split sizes were different - Improved sampling from source dataset - Reason: v1 NRG samples were not randomly selected
Drawbacks of Rotterdam EyePACS AIROGS: One of the largest drawbacks of the original dataset is the accessibility of the dataset. The dataset requires a long download, a large storage space, it spans several folders, and it is not machine-learning-ready (it requires data processing and splitting). The dataset also contains raw fundus images in their original dimensions; these original images often contain a large amount of black background and the dimensions are too large for machine learning inputs. The proposed dataset addresses the aforementioned concerns by image sampling and image standardization to balance and reduce the dataset size respectively.
Origin: The images in this dataset are sourced from the Rotterdam EyePACS AIROGS [1] dataset, which contains 113,893 color fundus images from 60,357 subjects and approximately 500 different sites with a heterogeneous ethnicity; this impressive dataset is over 60GB when compressed. The first lightweight version of the dataset is known as EyePACS-AIROGS-light (v1) [2].
About Me: I have studied glaucoma-related research for my computer science master's thesis. Since my graduation, I have dedicated my time to keeping my research up-to-date and relevant for fellow glaucoma researchers. I hope that my research can provi...
Facebook
TwitterThis data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of ultrafast ultrasound acquisitions from nine volunteers and the CIRS 054G phantom. For a comprehensive understanding of the dataset, please refer to the paper: Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. https://doi.org/10.3390/jimaging9120256. Please cite the original paper when using this dataset.
Due to data size restriction, the dataset has been divided into six subdatasets, each one published into a separate entry in Zenodo. This repository contains subdataset 2.
Structure
In Vivo Data
Number of Acquisitions: 20,000
Volunteers: Nine volunteers
File Structure: Each volunteer's data is compressed in a separate zip file.
Note: For volunteer 1, due to a higher number of acquisitions, data for this volunteer is distributed across multiple zip files, each containing acquisitions from different body regions.
Regions :
Abdomen: 6599 acquisitions
Neck: 3294 acquisitions
Breast: 3291 acquisitions
Lower limbs: 2616 acquisitions
Upper limbs: 2110 acquisitions
Back: 2090 acquisitions
File Naming Convention: Incremental IDs from acquisition_00000 to acquisition_19999.
In Vitro Data
Number of Acquisitions: 32 from CIRS model 054G phantom
File Structure: The in vitro data is compressed in the cirs-phantom.zip file.
File Naming Convention: Incremental IDs from invitro_00000 to invitro_00031.
CSV Files
Two CSV files are provided:
invivo_dataset.csv :
Contains a list of all in vivo acquisitions.
Columns: id, path, volunteer id, body region.
invitro_dataset.csv :
Contains a list of all in vitro acquisitions.
Columns: id, path
Zenodo dataset splits and files
The dataset has been divided into six subdatasets, each one published in a separate entry on Zenodo. The following table indicates, for each file or compressed folder, the Zenodo dataset split where it has been uploaded along with its size. Each dataset split is named "A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning: Dataset (ii/6)", where ii represents the split number. This repository contains the 2nd split.
File name Size Zenodo subdataset number
invivo_dataset.csv 995.9 kB 1
invitro_dataset.csv 1.1 kB 1
cirs-phantom.zip 418.2 MB 1
volunteer-1-lowerLimbs.zip 29.7 GB 1
volunteer-1-carotids.zip 8.8 GB 1
volunteer-1-back.zip 7.1 GB 1
volunteer-1-abdomen.zip 34.0 GB 2
volunteer-1-breast.zip 15.7 GB 2
volunteer-1-upperLimbs.zip 25.0 GB 3
volunteer-2.zip 26.5 GB 4
volunteer-3.zip 20.3 GB 3
volunteer-4.zip 24.1 GB 5
volunteer-5.zip 6.5 GB 5
volunteer-6.zip 11.5 GB 5
volunteer-7.zip 11.1 GB 6
volunteer-8.zip 21.2 GB 6
volunteer-9.zip 23.2 GB 4
Normalized RF Images
Beamforming:
Depth from 1 mm to 55 mm
Width spanning the probe aperture
Grid: 𝜆/8 × 𝜆/8
Resulting images shape: 1483 × 1189
Two beamformed RF images from each acquisition:
Input image: single unfocused acquisition obtained from a single plane wave (PW) steered at 0° (acquisition-xxxx-1PW)
Target image: coherently compounded image from 87 PWs acquisitions steered at different angles (acquisition-xxxx-87PWs)
Normalization:
The two RF images have been normalized
To display the images:
Perform the envelop detection (to obtain the IQ images)
Log-compress (to obtain the B-mode images)
File Format: Saved in npy format, loadable using Python and numpy.load(file).
Training and Validation Split in the paper
For the volunteer-based split used in the paper:
Training set: volunteers 1, 2, 3, 6, 7, 9
Validation set: volunteer 4
Test set: volunteers 5, 8
Images analyzed in the paper
Carotid acquisition (from volunteer 5): acquisition_12397
Back acquisition (from volunteer 8): acquisition_19764
In vitro acquisition: invitro-00030
License
This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Please cite the original paper when using this dataset :
Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. DOI: 10.3390/jimaging9120256
Contact
For inquiries or issues related to this dataset, please contact:
Name: Roser Viñals
Email: roser.vinalsterres@epfl.ch
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.
Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description. The NetVote dataset contains the outputs of the NetVote program when applied to voting data coming from VoteWatch (http://www.votewatch.eu/).
These results were used in the following conference papers:
Source code. The NetVote source code is available on GitHub: https://github.com/CompNet/NetVotes.
Citation. If you use our dataset or tool, please cite article [1] above.
@InProceedings{Mendonca2015, author = {Mendonça, Israel and Figueiredo, Rosa and Labatut, Vincent and Michelon, Philippe}, title = {Relevance of Negative Links in Graph Partitioning: A Case Study Using Votes From the {E}uropean {P}arliament}, booktitle = {2\textsuperscript{nd} European Network Intelligence Conference ({ENIC})}, year = {2015}, pages = {122-129}, address = {Karlskrona, SE}, publisher = {IEEE Publishing}, doi = {10.1109/ENIC.2015.25},}
-------------------------
Details. This archive contains the following folders:
-------------------------
License. These data are shared under a Creative Commons 0 license.
Contact. Vincent Labatut <vincent.labatut@univ-avignon.fr> & Rosa Figueiredo <rosa.figueiredo@univ-avignon.fr>
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
In molecular phylogenetics, partition models and mixture models provide different approaches to accommodating heterogeneity in genomic sequencing data. Both types of models generally give a superior fit to data than models that assume the process of sequence evolution is homogeneous across sites and lineages. The Akaike Information Criterion (AIC), an estimator of Kullback-Leibler divergence, and the Bayesian Information Criterion (BIC) are popular tools to select models in phylogenetics. Recent work suggests AIC should not be used for comparing mixture and partition models. In this work, we clarify that this difficulty is not fully explained by AIC misestimating the Kullback-Leibler divergence. We also investigate the performance of the AIC and BIC by comparing amongst mixture models and amongst partition models. We find that under non-standard conditions (i.e. when some edges have a small expected number of changes), AIC underestimates the expected Kullback-Leibler divergence. Under such conditions, AIC preferred the complex mixture models and BIC preferred the simpler mixture models. The mixture models selected by AIC had a better performance in estimating the edge length, while the simpler models selected by BIC performed better in estimating the base frequencies and substitution rate parameters. In contrast, AIC and BIC both prefer simpler partition models over more complex partition models under non-standard conditions, despite the fact that the more complex partition model was the generating model. We also investigated how mispartitioning (i.e. grouping sites that have not evolved under the same process) affects both the performance of partition models compared to mixture models and the model selection process. We found that as the level of mispartitioning increases, the bias of AIC in estimating the expected Kullback-Leibler divergence remains the same, and the branch lengths and evolutionary parameters estimated by partition models become less accurate. We recommend that researchers be cautious when using AIC and BIC to select among partition and mixture models; other alternatives, such as cross-validation and bootstrapping should be explored, but may suffer similar limitations. Methods This document records the pipeline used in data analyses in ``Performance of Akaike Information Criterion and Bayesian Information Criterion in selecting partition models and mixture models''. The main processes included generating alignments, fitting four different partition and mixture models, and analysing results. The data were generated under Seq-Gen-1.3.4 (Rambaut and Grass 1997). The model fitting was performed IQ-TREE2 (Minh et al. 2020) on a Linux system. The results were analysed using the R package phangorn in R (version 3.6.2) (Schliep 2011, R Core Team 2019). We wrote custom bash scripts to extract relevant parts of the results from IQ-TREE2, and these results were processed in R. The zip files contain four folders: "bash-scripts", "data", "R-codes", and "results-IQTREE2". The bash-scripts folder contains all the bash scripts for simulating alignments and performing model fitting. The "data" folder contains two child folders: "sequence-data" and "Rdata". The child folder "sequence-data" contains the alignments created for the simulations. The other child folder, "Rdata", contains the files created by R to store the results extracted from "IQTREE2" and the results calculated in R. The "R-codes" folder includes the R codes for analysing the results from "IQTREE2". The folder "results-IQTREE2" stores all the results from the fitted models. The three simulations we performed were essentially the same. We used the same parameters of the evolutionary models, and the trees with the same topologies but different edge lengths to generate the sequences. The steps we used were: simulating alignments, model fitting and extracting results, and processing the extracted results. The first two steps were performed on a Linux system using bash scripts, and the last step was performed in R. Simulating Alignment To simulate heterogeneous data we created two multiple sequence alignments (MSAs) under simple homogeneous models with each model comprising a substitution model and an edge-weighted phylogenetic tree (the tree topology was fixed). Each MSA contained eight taxa and 1000 sites. This was performed using the bash script “step1_seqgen_data.sh” in Linux. These two MSAs were then concatenated together giving a MSA with 2000 sites. This was equivalent to generating the concatenated MSA under a two-block unlinked edge lengths partition model (P-UEL). This was performed using the bash script “step2_concat_data.sh”. This created the 0% group of MSAs. In order to simulate a situation where the initial choice of blocks does not properly account for the heterogeneity in the concatenated MSA (i.e., mispartitioning), we randomly selected a proportion of 0%, 5%, 10%, 15%, …, up to 50% of sites from each block and swapped them. That is, the sites drawn from the first block were placed in the second block, and the sites drawn from the second block were placed in the first block. This process was repeated 100 times for each proportion of mispartitioned sites giving a total of 1100 MSAs. This process involved two steps. The first step was to generate ten sets of different amounts of numbers without duplicates from each of the two intervals [1,1000] and [1001,2000]. The amounts of numbers were based on the proportions of incorrectly partitioning sites. For example, the first set has 50 numbers on each interval, and the second set has 100 numbers on each interval, etc. The first step was performed in R, and the R code was not provided but the random number text files were included. The second step was to select sites from the concatenated MSAs from the locations based on the numbers created in the first step. This created 5%, 10%, 15%, …, 50% groups of MSAs. The second step used the following bash scripts: “step3_1_mixmatch_pre_data.sh” and “step3_2_mixmatch_data.sh”. The MSAs used in the simulations were created and stored in the “data” folder. Model Fitting and Extracting Results The next steps were to fit four different partition and mixture models to the data in IQ-TREE2 and extract the results. The models used were P-LEL partition model, P-UEL partition model, M-UGP mixture model, and M-LGP mixture model. For the partition models, the partitioning schemes were the same: the first 1000 sites as a block and the second 1000 sites as another. For the groups of MSAs with different proportions of mispartitioned sites, this was equivalent to fitting the partition models with an incorrect partitioning scheme. The partitioning scheme was called “parscheme.nex”. The bash scripts for model fitting were stored in the “bash-scripts” folder. To run the bash scripts, users can follow the order which was shown in the names of these bash scripts. The inferred trees, estimated base frequencies, estimated rate matrices, estimated weight factors and AIC values, and BIC values were extracted from the IQTREE2 results. These extracted results were stored in the “results-IQTREE2” folder and used to evaluate the performance of AIC, BIC, and models in R. Processing Extracted Results in R To evaluate the performance of AIC, BIC, and the performance of fitted partition models and mixture models, we calculated the following measures: the rEKL values, the bias of AIC in estimating the rEKL, BIC values, and the branch scores (bs). We also compared the distribution of the estimated model parameters (i.e. base frequencies and rate matrices) to the generating model parameters. These processes were performed in R. The first step was to read in the inferred trees, estimated base frequencies, estimated rate matrices, estimated weight factors, AIC values, and BIC values that were extracted from IQTREE2 results. These R scripts were stored in the “R-codes” folder, and the names of these scripts started with “readpara_...” (e.g. “readpara_MLGP_standard”). After reading in all the parameters for each model, we estimated the measures mentioned above using the corresponding R scripts that were also in the “R-codes” folder. The functions used in these R scripts were stored in the “R_functions_simulation”. It is worth noting that the directories need to be changed if users want to run these R scripts on their computers.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset provides grayscale pixel values for brain tumor MRI images, stored in a CSV format for simplified access and ease of use. The goal is to create a "MNIST-like" dataset for brain tumors, where each row in the CSV file represents the pixel values of a single image in its original resolution. This format makes it convenient for researchers and developers to quickly load and analyze MRI data for brain tumor detection, classification, and segmentation tasks without needing to handle large image files directly.
Brain tumor classification and segmentation are critical tasks in medical imaging, and datasets like these are valuable for developing and testing machine learning and deep learning models. While there are several publicly available brain tumor image datasets, they often consist of large image files that can be challenging to process. This CSV-based dataset addresses that by providing a compact and accessible format. Potential use cases include: - Tumor Classification: Identifying different types of brain tumors, such as glioma, meningioma, and pituitary tumors, or distinguishing between tumor and non-tumor images. - Tumor Segmentation: Applying pixel-level classification and segmentation techniques for tumor boundary detection. - Educational and Rapid Prototyping: Ideal for educational purposes or quick experimentation without requiring large image processing capabilities.
This dataset is structured as a single CSV file where each row represents an image, and each column represents a grayscale pixel value. The pixel values are stored as integers ranging from 0 (black) to 255 (white).
This dataset is intended for research and educational purposes only. Users are encouraged to cite and credit the original data sources if using this dataset in any publications or projects. This is a derived CSV version aimed to simplify access and usability for machine learning and data science applications.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of ultrafast ultrasound acquisitions from nine volunteers and the CIRS 054G phantom. For a comprehensive understanding of the dataset, please refer to the paper: Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. https://doi.org/10.3390/jimaging9120256. Please cite the original paper when using this dataset.
Due to data size restriction, the dataset has been divided into six subdatasets, each one published into a separate entry in Zenodo. This repository contains subdataset 5.
Structure
In Vivo Data
Number of Acquisitions: 20,000
Volunteers: Nine volunteers
File Structure: Each volunteer's data is compressed in a separate zip file.
Note: For volunteer 1, due to a higher number of acquisitions, data for this volunteer is distributed across multiple zip files, each containing acquisitions from different body regions.
Regions :
Abdomen: 6599 acquisitions
Neck: 3294 acquisitions
Breast: 3291 acquisitions
Lower limbs: 2616 acquisitions
Upper limbs: 2110 acquisitions
Back: 2090 acquisitions
File Naming Convention: Incremental IDs from acquisition_00000 to acquisition_19999.
In Vitro Data
Number of Acquisitions: 32 from CIRS model 054G phantom
File Structure: The in vitro data is compressed in the cirs-phantom.zip file.
File Naming Convention: Incremental IDs from invitro_00000 to invitro_00031.
CSV Files
Two CSV files are provided:
invivo_dataset.csv :
Contains a list of all in vivo acquisitions.
Columns: id, path, volunteer id, body region.
invitro_dataset.csv :
Contains a list of all in vitro acquisitions.
Columns: id, path
Zenodo dataset splits and files
The dataset has been divided into six subdatasets, each one published in a separate entry on Zenodo. The following table indicates, for each file or compressed folder, the Zenodo dataset split where it has been uploaded along with its size. Each dataset split is named "A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning: Dataset (ii/6)", where ii represents the split number. This repository contains the 5th split.
File name Size Zenodo subdataset number
invivo_dataset.csv 995.9 kB 1
invitro_dataset.csv 1.1 kB 1
cirs-phantom.zip 418.2 MB 1
volunteer-1-lowerLimbs.zip 29.7 GB 1
volunteer-1-carotids.zip 8.8 GB 1
volunteer-1-back.zip 7.1 GB 1
volunteer-1-abdomen.zip 34.0 GB 2
volunteer-1-breast.zip 15.7 GB 2
volunteer-1-upperLimbs.zip 25.0 GB 3
volunteer-2.zip 26.5 GB 4
volunteer-3.zip 20.3 GB 3
volunteer-4.zip 24.1 GB 5
volunteer-5.zip 6.5 GB 5
volunteer-6.zip 11.5 GB 5
volunteer-7.zip 11.1 GB 6
volunteer-8.zip 21.2 GB 6
volunteer-9.zip 23.2 GB 4
Normalized RF Images
Beamforming:
Depth from 1 mm to 55 mm
Width spanning the probe aperture
Grid: 𝜆/8 × 𝜆/8
Resulting images shape: 1483 × 1189
Two beamformed RF images from each acquisition:
Input image: single unfocused acquisition obtained from a single plane wave (PW) steered at 0° (acquisition-xxxx-1PW)
Target image: coherently compounded image from 87 PWs acquisitions steered at different angles (acquisition-xxxx-87PWs)
Normalization:
The two RF images have been normalized
To display the images:
Perform the envelop detection (to obtain the IQ images)
Log-compress (to obtain the B-mode images)
File Format: Saved in npy format, loadable using Python and numpy.load(file).
Training and Validation Split in the paper
For the volunteer-based split used in the paper:
Training set: volunteers 1, 2, 3, 6, 7, 9
Validation set: volunteer 4
Test set: volunteers 5, 8
Images analyzed in the paper
Carotid acquisition (from volunteer 5): acquisition_12397
Back acquisition (from volunteer 8): acquisition_19764
In vitro acquisition: invitro-00030
License
This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Please cite the original paper when using this dataset :
Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. DOI: 10.3390/jimaging9120256
Contact
For inquiries or issues related to this dataset, please contact:
Name: Roser Viñals
Email: roser.vinalsterres@epfl.ch
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of ultrafast ultrasound acquisitions from nine volunteers and the CIRS 054G phantom. For a comprehensive understanding of the dataset, please refer to the paper: Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. https://doi.org/10.3390/jimaging9120256. Please cite the original paper when using this dataset.
Due to data size restriction, the dataset has been divided into six subdatasets, each one published into a separate entry in Zenodo. This repository contains subdataset 3.
Structure
In Vivo Data
Number of Acquisitions: 20,000
Volunteers: Nine volunteers
File Structure: Each volunteer's data is compressed in a separate zip file.
Note: For volunteer 1, due to a higher number of acquisitions, data for this volunteer is distributed across multiple zip files, each containing acquisitions from different body regions.
Regions :
Abdomen: 6599 acquisitions
Neck: 3294 acquisitions
Breast: 3291 acquisitions
Lower limbs: 2616 acquisitions
Upper limbs: 2110 acquisitions
Back: 2090 acquisitions
File Naming Convention: Incremental IDs from acquisition_00000 to acquisition_19999.
In Vitro Data
Number of Acquisitions: 32 from CIRS model 054G phantom
File Structure: The in vitro data is compressed in the cirs-phantom.zip file.
File Naming Convention: Incremental IDs from invitro_00000 to invitro_00031.
CSV Files
Two CSV files are provided:
invivo_dataset.csv :
Contains a list of all in vivo acquisitions.
Columns: id, path, volunteer id, body region.
invitro_dataset.csv :
Contains a list of all in vitro acquisitions.
Columns: id, path
Zenodo dataset splits and files
The dataset has been divided into six subdatasets, each one published in a separate entry on Zenodo. The following table indicates, for each file or compressed folder, the Zenodo dataset split where it has been uploaded along with its size. Each dataset split is named "A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning: Dataset (ii/6)", where ii represents the split number. This repository contains the 3rd split.
File name Size Zenodo subdataset number
invivo_dataset.csv 995.9 kB 1
invitro_dataset.csv 1.1 kB 1
cirs-phantom.zip 418.2 MB 1
volunteer-1-lowerLimbs.zip 29.7 GB 1
volunteer-1-carotids.zip 8.8 GB 1
volunteer-1-back.zip 7.1 GB 1
volunteer-1-abdomen.zip 34.0 GB 2
volunteer-1-breast.zip 15.7 GB 2
volunteer-1-upperLimbs.zip 25.0 GB 3
volunteer-2.zip 26.5 GB 4
volunteer-3.zip 20.3 GB 3
volunteer-4.zip 24.1 GB 5
volunteer-5.zip 6.5 GB 5
volunteer-6.zip 11.5 GB 5
volunteer-7.zip 11.1 GB 6
volunteer-8.zip 21.2 GB 6
volunteer-9.zip 23.2 GB 4
Normalized RF Images
Beamforming:
Depth from 1 mm to 55 mm
Width spanning the probe aperture
Grid: 𝜆/8 × 𝜆/8
Resulting images shape: 1483 × 1189
Two beamformed RF images from each acquisition:
Input image: single unfocused acquisition obtained from a single plane wave (PW) steered at 0° (acquisition-xxxx-1PW)
Target image: coherently compounded image from 87 PWs acquisitions steered at different angles (acquisition-xxxx-87PWs)
Normalization:
The two RF images have been normalized
To display the images:
Perform the envelop detection (to obtain the IQ images)
Log-compress (to obtain the B-mode images)
File Format: Saved in npy format, loadable using Python and numpy.load(file).
Training and Validation Split in the paper
For the volunteer-based split used in the paper:
Training set: volunteers 1, 2, 3, 6, 7, 9
Validation set: volunteer 4
Test set: volunteers 5, 8
Images analyzed in the paper
Carotid acquisition (from volunteer 5): acquisition_12397
Back acquisition (from volunteer 8): acquisition_19764
In vitro acquisition: invitro-00030
License
This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Please cite the original paper when using this dataset :
Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In Vivo Ultrafast Ultrasound Image Enhancement with Deep Learning. J. Imaging 2023, 9, 256. DOI: 10.3390/jimaging9120256
Contact
For inquiries or issues related to this dataset, please contact:
Name: Roser Viñals
Email: roser.vinalsterres@epfl.ch
Facebook
TwitterValidating a novel housing method for inbred mice: mixed-strain housing. To see if this housing method affected strain-typical mouse phenotypes, if variance in the data was affected, and how statistical power was increased through this split-plot design.
Facebook
Twitterhttps://www.worldbank.org/en/about/legal/terms-of-use-for-datasetshttps://www.worldbank.org/en/about/legal/terms-of-use-for-datasets
This dataset provides a comprehensive overview of various socio-economic and health metrics related to gender across different countries. The metrics range from life expectancy, schooling, and gross national income per capita to maternal mortality rates, adolescent birth rates, and labor force participation. Such data is vital for researchers, policymakers, and advocates working towards gender equality and understanding the intricate nuances of gender disparities in different regions.
Notably, this dataset has been featured as an example dataset in the R programming language package named genderstat.
Link to CRAN package: https://cran.r-project.org/web/packages/genderstat/index.html
Data for this collection was meticulously extracted from reputable sources to ensure its accuracy and reliability.
Sources:
UNDP Human Development Reports Data Center World Bank Gender Data Portal
Dive into the dataset to explore the varying dimensions of gender disparities and gain insights that can guide interventions and policy decisions.
Facebook
Twitterhttps://www.elsevier.com/about/policies/open-access-licenses/elsevier-user-license/cpc-license/https://www.elsevier.com/about/policies/open-access-licenses/elsevier-user-license/cpc-license/
Abstract The main part of the code presented in this work represents an implementation of the split-operator method [J.A. Fleck, J.R. Morris, M.D. Feit, Appl. Phys. 10 (1976) 129-160; R. Heather, Comput. Phys. Comm. 63 (1991) 446] for calculating the time-evolution of Dirac wave functions. It allows to study the dynamics of electronic Dirac wave packets under the influence of any number of laser pulses and its interaction with any number of charged ion potentials. The initial wave function can be eith...
Title of program: Dirac++ or (abbreviated) d++ Catalogue Id: AEAS_v1_0
Nature of problem The relativistic time evolution of wave functions according to the Dirac equation is a challenging numerical task. Especially for an electron in the presence of high intensity laser beams and/or highly charged ions, this type of problem is of considerable interest to atomic physicists.
Versions of this program held in the CPC repository in Mendeley Data AEAS_v1_0; Dirac++ or (abbreviated) d++; 10.1016/j.cpc.2008.01.042
This program has been imported from the CPC Program Library held at Queen's University Belfast (1969-2019)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Materials that efficiently promote the thermodynamically uphill water-splitting reaction under solar illumination are essential for generating carbon-free ("green") hydrogen. Mapping out the combinatorial space of potential photocatalysts for this reaction can be expedited using data-intensive materials exploration. The calculated band gaps and band alignments can serve as key indicators and metrics to computationally screen photoactive materials. Ternary main-group metal sulfides containing p- and s-block elements represent a promising, albeit underexplored, class of photocatalysts. Here, we computationally screen 86 candidate ternary main-group metal sulfides containing p- and s-block elements. By validating electronic structure predictions against experimental band gaps and band edges for synthetically accessible materials, we propose eight potential photocatalysts. Using computed Pourbaix diagrams, we further narrowed the candidate pool to four materials based on the predicted aqueous stability. We then synthesized and characterized these four materials and experimentally screened them for photoresponsiveness under photocatalytically relevant conditions. We also characterized their experimental band gaps and band edge positions and compared them with computational predictions. Based on the experimental screening protocols, we identify MgIn₂S₄ and BaSn₂S₅ as photoresponsive materials with sufficient aqueous stability to be considered in greater depth as potential photocatalysts for overall water-splitting. This record contains the computational predictions for the four candidates discussed in our manuscript.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Enriched electronic health records (EHRs) contain crucial information related to disease progression, and this information can help with decision-making in the health care field. Data analytics in health care is deemed as one of the essential processes that help accelerate the progress of clinical research. However, processing and analyzing EHR data are common bottlenecks in health care data analytics. The dxpr R package provides mechanisms for integration, wrangling, and visualization of clinical data, including diagnosis and procedure records. First, the dxpr package helps users transform International Classification of Diseases (ICD) codes to a uniform format. After code format transformation, the dxpr package supports four strategies for grouping clinical diagnostic data. For clinical procedure data, two grouping methods can be chosen. After EHRs are integrated, users can employ a set of flexible built-in querying functions for dividing data into case and control groups by using specified criteria and splitting the data into before and after an event based on the record date. Subsequently, the structure of integrated long data can be converted into wide, analysis-ready data that are suitable for statistical analysis and visualization. We conducted comorbidity data processes based on a cohort of newborns from Medical Information Mart for Intensive Care-III (n = 7,833) by using the dxpr package. We first defined patent ductus arteriosus (PDA) cases as patients who had at least one PDA diagnosis (ICD, Ninth Revision, Clinical Modification [ICD-9-CM] 7470*). Controls were defined as patients who never had PDA diagnosis. In total, 381 and 7,452 patients with and without PDA, respectively, were included in our study population. Then, we grouped the diagnoses into defined comorbidities. Finally, we observed a statistically significant difference in 8 of the 16 comorbidities among patients with and without PDA, including fluid and electrolyte disorders, valvular disease, and others.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Reddit [source]
Explore the side-splitting world of dad jokes on Reddit! This dataset delves into the humorous dad jokes that abound in the popular subreddit r/dadjokes. Analyze this data to gain insight into which jokes are most popular and why, as well as to discover how audience laughter is measured on Reddit. With columns including 'title', 'score', 'url', 'comms_num', 'created', 'body' and a timestamp, you can use this data to understand what makes a joke truly successful and how Reddit users rate them. Join in on the fun--who knows, you might even learn some quality dad puns yourself!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Running sentiment analysis on dad jokes to uncover themes and patterns in humorous text.
- Performing a cluster analysis of similar dad jokes to uncover hidden relationships between them.
- Analyzing the popularity and interaction of different types of dad jokes; by looking at score, comments, and URL clicks of each joke we could assess which are the most liked or highly visited among Redditors
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: dadjokes.csv | Column name | Description | |:--------------|:--------------------------------------------------------------| | title | The title of the dad joke. (String) | | score | The number of upvotes the joke has received. (Integer) | | url | The URL of the dad joke. (String) | | comms_num | The total number of comments the joke has received. (Integer) | | created | The date and time the joke was posted. (DateTime) | | body | The actual content of the dad joke. (String) | | timestamp | The timestamp of when the joke was posted. (Integer) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################
###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################
###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)
###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.
###To look at samples of the data, uncomment this line:
###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe
im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe
im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe
################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer
#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))
###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.
library("foreach", lib.loc="~/R/win-library/3.3")
###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):
###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')
#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)
#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).
#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))
#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")
#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }
#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")
#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)
#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)
#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:
library(...
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.