Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.
This comes directly from the README:
The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:
CustomerID,Rating,Date
Movie information in "movie_titles.txt" is in the following format:
MovieID,YearOfRelease,Title
The qualifying dataset for the Netflix Prize is contained in the text file "qualifying.txt". It consists of lines indicating a movie id, followed by a colon, and then customer ids and rating dates, one per line for that movie id. The movie and customer ids are contained in the training set. Of course the ratings are withheld. There are no empty lines in the file.
MovieID1:
CustomerID11,Date11
CustomerID12,Date12
...
MovieID2:
CustomerID21,Date21
CustomerID22,Date22
For the Netflix Prize, your program must predict the all ratings the customers gave the movies in the qualifying dataset based on the information in the training dataset.
The format of your submitted prediction file follows the movie and customer id, date order of the qualifying dataset. However, your predicted rating takes the place of the corresponding customer id (and date), one per line.
For example, if the qualifying dataset looked like:
111:
3245,2005-12-19
5666,2005-12-23
6789,2005-03-14
225:
1234,2005-05-26
3456,2005-11-07
then a prediction file should look something like:
111:
3.0
3.4
4.0
225:
1.0
2.0
which predicts that customer 3245 would have rated movie 111 3.0 stars on the 19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher at 3.4 stars on the 23rd of Decemeber, 2005, etc.
You must make predictions for all customers for all movies in the qualifying dataset.
To allow you to test your system before you submit a prediction set based on the qualifying dataset, we have provided a probe dataset in the file "probe.txt". This text file contains lines indicating a movie id, followed by a colon, and then customer ids, one per line for that movie id.
MovieID1:
CustomerID11
CustomerID12
...
MovieID2:
CustomerID21
CustomerID22
Like the qualifying dataset, the movie and customer id pairs are contained in the training set. However, unlike the qualifying dataset, the ratings (and dates) for each pair are contained in the training dataset.
If you wish, you may calculate the RMSE of your predictions against those ratings and compare your RMSE against the Cinematch RMSE on the same data. See http://www.netflixprize.com/faq#probe for that value.
The training data came in 17,000+ files. In the interest of keeping files together and file sizes as low as possible, I combined them into four text files: combined_data_(1,2,3,4).txt
The contest was originally hosted at http://netflixprize.com/index.html
The dataset was downloaded from https://archive.org/download/nf_prize_dataset.tar
This is a fun dataset to work with. You can read about the winning algorithm by BellKor's Pragmatic Chaos here
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The World Bank is an international financial institution that provides loans to countries of the world for capital projects. The World Bank's stated goal is the reduction of poverty. Source: https://en.wikipedia.org/wiki/World_Bank
This dataset combines key education statistics from a variety of sources to provide a look at global literacy, spending, and access.
For more information, see the World Bank website.
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:world_bank_health_population
http://data.worldbank.org/data-catalog/ed-stats
https://cloud.google.com/bigquery/public-data/world-bank-education
Citation: The World Bank: Education Statistics
Dataset Source: World Bank. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @till_indeman from Unplash.
Of total government spending, what percentage is spent on education?
Starbucks started as a roaster and retailer of whole bean and ground coffee, tea and spices with a single store in Seattleās Pike Place Market in 1971. The company now operates more than 24,000 retail stores in 70 countries.
This dataset includes a record for every Starbucks or subsidiary store location currently in operation as of February 2017.
This data was scraped from the Starbucks store locator webpage by Github user chrismeller.
What city or country has the highest number of Starbucks stores per capita? What two Starbucks locations are the closest in proximity to one another? What location on Earth is farthest from a Starbucks? How has Starbucks expanded overseas?
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Type and address of emergency incident to which OEM responded
This is a dataset hosted by the City of New York. The city has an open data platform found here and they update their information according the amount of data that is brought in. Explore New York City using Kaggle and all of the data sources available through the City of New York organization page!
This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Each file contains klines for 1 month period with 1 minute intervals. File name formating looks like mm-yyyy-SMB1SMB2 (e.g. 11-2017-XRPBTC).
This data set contains now only XRP/BTC and ETH/USDT symbol pair now, but it will be expand soon.
This dataset was collected from Binance Exchange | Worlds Largest Crypto Exchange
This data set could inspire you on most efficient trading algorithms.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Provides data from the IBRD Statement of Income for the fiscal years ended June 30, 2013, June 30, 2012 and June 30, 2011. The values are expressed in millions of U.S. Dollars. Where applicable, changes have been made to certain line items on FY 2012 income statement to conform with the current year's presentation, but the comparable prior years' data sets have not been adjusted to reflect the reclassification impact of those changes.
This is a dataset hosted by the World Bank. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore World Bank's Financial Data using Kaggle and all of the data sources available through the World Bank organization page!
This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.
This dataset is distributed under a Creative Commons Attribution 3.0 IGO license.
Cover photo by Matt Artz on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
This dataset is distributed under Creative Commons Attribution 3.0 IGO
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.
This comes directly from the README:
The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:
CustomerID,Rating,Date
Movie information in "movie_titles.txt" is in the following format:
MovieID,YearOfRelease,Title
The qualifying dataset for the Netflix Prize is contained in the text file "qualifying.txt". It consists of lines indicating a movie id, followed by a colon, and then customer ids and rating dates, one per line for that movie id. The movie and customer ids are contained in the training set. Of course the ratings are withheld. There are no empty lines in the file.
MovieID1:
CustomerID11,Date11
CustomerID12,Date12
...
MovieID2:
CustomerID21,Date21
CustomerID22,Date22
For the Netflix Prize, your program must predict the all ratings the customers gave the movies in the qualifying dataset based on the information in the training dataset.
The format of your submitted prediction file follows the movie and customer id, date order of the qualifying dataset. However, your predicted rating takes the place of the corresponding customer id (and date), one per line.
For example, if the qualifying dataset looked like:
111:
3245,2005-12-19
5666,2005-12-23
6789,2005-03-14
225:
1234,2005-05-26
3456,2005-11-07
then a prediction file should look something like:
111:
3.0
3.4
4.0
225:
1.0
2.0
which predicts that customer 3245 would have rated movie 111 3.0 stars on the 19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher at 3.4 stars on the 23rd of Decemeber, 2005, etc.
You must make predictions for all customers for all movies in the qualifying dataset.
To allow you to test your system before you submit a prediction set based on the qualifying dataset, we have provided a probe dataset in the file "probe.txt". This text file contains lines indicating a movie id, followed by a colon, and then customer ids, one per line for that movie id.
MovieID1:
CustomerID11
CustomerID12
...
MovieID2:
CustomerID21
CustomerID22
Like the qualifying dataset, the movie and customer id pairs are contained in the training set. However, unlike the qualifying dataset, the ratings (and dates) for each pair are contained in the training dataset.
If you wish, you may calculate the RMSE of your predictions against those ratings and compare your RMSE against the Cinematch RMSE on the same data. See http://www.netflixprize.com/faq#probe for that value.
The training data came in 17,000+ files. In the interest of keeping files together and file sizes as low as possible, I combined them into four text files: combined_data_(1,2,3,4).txt
The contest was originally hosted at http://netflixprize.com/index.html
The dataset was downloaded from https://archive.org/download/nf_prize_dataset.tar
This is a fun dataset to work with. You can read about the winning algorithm by BellKor's Pragmatic Chaos here