Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a sample of the training data used in the Numerai machine learning competition. https://numer.ai/about
The data is cleaned, regularized and encrypted global equity data. The first 21 columns (feature1 - feature21) are features, and target is the binary class you’re trying to predict.
We want to see what the Kaggle community will produce with this dataset using Kernels.
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
20 years of Yahoo Finance Open, High, Low, Close, Adjusted Close, Volume data, plus generated technical features (RSI, SMA) on close to 5000 global equities. Various targets including 20 days raw returns, residual returns, etc. Use to create predictive models on Numerai Signals tournament to stake and earn/burn $NMR.
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Daily updated data for Numerai Signals. Data is from 2003 till today and includes over 5000 stocks from 26 markets, as well as 57 basic factors (e.g. growth, value, momentum, etc).
This dataset now always downloads daily all available files of the latest version (currently, as of 3rd of August, 2025, v2.1), which now includes the neutralisation matrix, latest targets, etc.
See Code tab for a starter notebook.
Facebook
TwitterThis Stock price OHLCV data is updated daily to be used for the weekly submission to the Numerai Signals. Note that there are some very strange values especially for (adjusted) close and volume data, which are known to be an issue with the Yahoo! Finance API. When you use this data, make sure that you deal with these unrealisitc values.
Facebook
TwitterPickle file denoting intersection of valid Numerai Signals tickers and tickers available on stocknewsapi.com
File is a dictionary with one key ("tickers") which points to a list of stock tickers.
Facebook
TwitterHighlights We have just released the biggest upgrade to Numerai’s dataset ever. The new dataset has 4x the number of rows, more than 3x the number of features, and 20 optional targets. The fastest way to get started with the new dataset is to run through the new example scripts 43 You can continue to use the old dataset in the same way but models on the new dataset have much higher scores in historical tests. The website’s “Download Data” button will only download new data. The legacy data can still be downloaded via the API (GraphQL or NumerAPI) The website’s “Upload Predictions” button will only work for predictions made on the new data. Submissions using the legacy data can still be made via the API New Data The new data has both more features and more eras. There are now 1050 features instead of 310, and a total of 679 training and validation eras with targets provided instead of 142.
The eras are now weekly instead of monthly. This means that eras match the tournament more precisely, however they are now “overlapping”. This means that nearby eras are correlated with one another because their targets are generated from stock market performance from a shared, or “overlapping”, period of time.
1054×650 9.55 KB The new “training” period covers the same time period as eras #1-132 in the old data, but is now weekly rather than monthly.
The new “test” period is the same as the previous “test” period.
The new “validation” period covers the same time period as eras #197-212 in the old data plus an additional time period, and is now weekly rather than monthly.
The new “live” period functions just like the “live” period in the old data.
training_data One continuous period of historical data Has targets provided tournament_data Consists of “test” and “live” All of these rows must be predicted for a submission to be valid No targets provided Test is used for internal testing, but is not part of the tournament scoring and payouts Live is what users stake on and are scored on in the tournament validation_data A separate file. Predictions on these rows are not required for submission It can be submitted at any time to receive diagnostics on your predictions Has targets provided This is the most recent data that we provide, far removed from training data. This makes it particularly useful for seeing how your models’ performance declines over time, and how it would have been performing lately.
568×1372 24.7 KB New Targets The final major change is that there are now many different targets in the dataset. The tournament target, which is the one you are scored on, is always called “target”. Currently “target” corresponds to “target_nomi_20”, but this may change in the future. However you will also find 20 more targets which are not scored on, but you may find useful for training. The 20 targets consist of 10 different types of targets constructed using 2 different time periods, 20 and 60 days. Additional targets may also be added in the future.
Be aware that some of the new targets have different binning distributions than what you see with Nomi, i.e. 7 bins rather than 5, with less rigid constraints on samples per bin. Training models to be good at multiple targets and/or ensembling models trained on different targets is a great way to improve generalization performance and increase the uniqueness of your model.
The new targets are regularized in different ways and exhibit a range of correlations with each other from around ~0.3 to ~0.9. Due to this regularization you may find that models trained on some of the new targets generalize to predict “target” better than models trained on “target”. Other targets may yield models that appear to generalize poorly to “target” but end up helping in an ensemble.
You may also find that training on the 60 day targets, e.g. “target_nomi_60” yields more stable models when scored on the 20 day “target”. But beware: the eras are even more overlapped when using 60 day targets! You need to sample every 4th era to get non-overlapping eras with the 20 day targets, but every 12th era to get non-overlapping eras with the 60 day targets. If you choose not to subsample in this way, you instead need to be very careful about purging overlapping eras from your cross-validation folds. With great power comes great responsibility!
Finally, be careful about just selecting a target that does well on Validation. Target selection is yet another way to overfit. When in doubt, cross-validate!
API The new data can be accessed either through the “Download Data” button in the leaderboard sidebar or through s3 links returned by the dataset API using the filename argument; a list of valid filenames can be retrieved through the new list_datasets API query. The new training_data and validation_data files will be the same every week, while the tournament_data file will be updated with the latest live er...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a sample of the training data used in the Numerai machine learning competition. https://numer.ai/about
The data is cleaned, regularized and encrypted global equity data. The first 21 columns (feature1 - feature21) are features, and target is the binary class you’re trying to predict.
We want to see what the Kaggle community will produce with this dataset using Kernels.