6 datasets found
  1. Netflix Prize data

    • kaggle.com
    zip
    Updated Jul 19, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Netflix (2017). Netflix Prize data [Dataset]. https://www.kaggle.com/netflix-inc/netflix-prize-data
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Jul 19, 2017
    Dataset authored and provided by
    Netflixhttp://netflix.com/
    Description

    Context

    Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.

    Content

    This comes directly from the README:

    TRAINING DATASET FILE DESCRIPTION

    The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:

    CustomerID,Rating,Date

    • MovieIDs range from 1 to 17770 sequentially.
    • CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
    • Ratings are on a five star (integral) scale from 1 to 5.
    • Dates have the format YYYY-MM-DD.

    MOVIES FILE DESCRIPTION

    Movie information in "movie_titles.txt" is in the following format:

    MovieID,YearOfRelease,Title

    • MovieID do not correspond to actual Netflix movie ids or IMDB movie ids.
    • YearOfRelease can range from 1890 to 2005 and may correspond to the release of corresponding DVD, not necessarily its theaterical release.
    • Title is the Netflix movie title and may not correspond to titles used on other sites. Titles are in English.

    QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION

    The qualifying dataset for the Netflix Prize is contained in the text file "qualifying.txt". It consists of lines indicating a movie id, followed by a colon, and then customer ids and rating dates, one per line for that movie id. The movie and customer ids are contained in the training set. Of course the ratings are withheld. There are no empty lines in the file.

    MovieID1:

    CustomerID11,Date11

    CustomerID12,Date12

    ...

    MovieID2:

    CustomerID21,Date21

    CustomerID22,Date22

    For the Netflix Prize, your program must predict the all ratings the customers gave the movies in the qualifying dataset based on the information in the training dataset.

    The format of your submitted prediction file follows the movie and customer id, date order of the qualifying dataset. However, your predicted rating takes the place of the corresponding customer id (and date), one per line.

    For example, if the qualifying dataset looked like:

    111:

    3245,2005-12-19

    5666,2005-12-23

    6789,2005-03-14

    225:

    1234,2005-05-26

    3456,2005-11-07

    then a prediction file should look something like:

    111:

    3.0

    3.4

    4.0

    225:

    1.0

    2.0

    which predicts that customer 3245 would have rated movie 111 3.0 stars on the 19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher at 3.4 stars on the 23rd of Decemeber, 2005, etc.

    You must make predictions for all customers for all movies in the qualifying dataset.

    THE PROBE DATASET FILE DESCRIPTION

    To allow you to test your system before you submit a prediction set based on the qualifying dataset, we have provided a probe dataset in the file "probe.txt". This text file contains lines indicating a movie id, followed by a colon, and then customer ids, one per line for that movie id.

    MovieID1:

    CustomerID11

    CustomerID12

    ...

    MovieID2:

    CustomerID21

    CustomerID22

    Like the qualifying dataset, the movie and customer id pairs are contained in the training set. However, unlike the qualifying dataset, the ratings (and dates) for each pair are contained in the training dataset.

    If you wish, you may calculate the RMSE of your predictions against those ratings and compare your RMSE against the Cinematch RMSE on the same data. See http://www.netflixprize.com/faq#probe for that value.

    Acknowledgements

    The training data came in 17,000+ files. In the interest of keeping files together and file sizes as low as possible, I combined them into four text files: combined_data_(1,2,3,4).txt

    The contest was originally hosted at http://netflixprize.com/index.html

    The dataset was downloaded from https://archive.org/download/nf_prize_dataset.tar

    Inspiration

    This is a fun dataset to work with. You can read about the winning algorithm by BellKor's Pragmatic Chaos here

  2. World Bank: Education Data

    • kaggle.com
    zip
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2019). World Bank: Education Data [Dataset]. https://www.kaggle.com/datasets/theworldbank/world-bank-intl-education
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The World Bank is an international financial institution that provides loans to countries of the world for capital projects. The World Bank's stated goal is the reduction of poverty. Source: https://en.wikipedia.org/wiki/World_Bank

    Content

    This dataset combines key education statistics from a variety of sources to provide a look at global literacy, spending, and access.

    For more information, see the World Bank website.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:world_bank_health_population

    http://data.worldbank.org/data-catalog/ed-stats

    https://cloud.google.com/bigquery/public-data/world-bank-education

    Citation: The World Bank: Education Statistics

    Dataset Source: World Bank. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Banner Photo by @till_indeman from Unplash.

    Inspiration

    Of total government spending, what percentage is spent on education?

  3. Starbucks Locations Worldwide

    • kaggle.com
    zip
    Updated Feb 13, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Starbucks (2017). Starbucks Locations Worldwide [Dataset]. https://www.kaggle.com/starbucks/store-locations
    Explore at:
    zip(1149144 bytes)Available download formats
    Dataset updated
    Feb 13, 2017
    Dataset authored and provided by
    Starbuckshttp://starbucks.com/
    Description

    Context

    Starbucks started as a roaster and retailer of whole bean and ground coffee, tea and spices with a single store in Seattle’s Pike Place Market in 1971. The company now operates more than 24,000 retail stores in 70 countries.

    Content

    This dataset includes a record for every Starbucks or subsidiary store location currently in operation as of February 2017.

    Acknowledgements

    This data was scraped from the Starbucks store locator webpage by Github user chrismeller.

    Inspiration

    What city or country has the highest number of Starbucks stores per capita? What two Starbucks locations are the closest in proximity to one another? What location on Earth is farthest from a Starbucks? How has Starbucks expanded overseas?

  4. NY Emergency Response Incidents

    • kaggle.com
    Updated Dec 2, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of New York (2019). NY Emergency Response Incidents [Dataset]. https://www.kaggle.com/new-york-city/ny-emergency-response-incidents/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 2, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    City of New York
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    New York, New York
    Description

    Content

    Type and address of emergency incident to which OEM responded

    Context

    This is a dataset hosted by the City of New York. The city has an open data platform found here and they update their information according the amount of data that is brought in. Explore New York City using Kaggle and all of the data sources available through the City of New York organization page!

    • Update Frequency: This dataset is updated monthly.

    Acknowledgements

    This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.

  5. Binance Crypto Klines

    • kaggle.com
    zip
    Updated Apr 8, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Binance (2018). Binance Crypto Klines [Dataset]. https://www.kaggle.com/binance/binance-crypto-klines
    Explore at:
    zip(1033121370 bytes)Available download formats
    Dataset updated
    Apr 8, 2018
    Dataset authored and provided by
    Binancehttp://binance.com/
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Each file contains klines for 1 month period with 1 minute intervals. File name formating looks like mm-yyyy-SMB1SMB2 (e.g. 11-2017-XRPBTC).

    This data set contains now only XRP/BTC and ETH/USDT symbol pair now, but it will be expand soon.

    Features

    • Open time -> timestamp (milliseconds)
    • Open price -> float
    • High price -> float
    • Low price -> float
    • Close price -> float
    • Volume -> float
    • Quote asset volume -> float
    • Close time -> timestamp (milliseconds)
    • Number of trades -> int
    • Taker buy base asset volume -> float
    • Taker buy quote asset volume -> float

    Acknowledgements

    This dataset was collected from Binance Exchange | Worlds Largest Crypto Exchange

    Inspiration

    This data set could inspire you on most efficient trading algorithms.

  6. IBRD Statement Of Income FY2013

    • kaggle.com
    zip
    Updated Apr 9, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2019). IBRD Statement Of Income FY2013 [Dataset]. https://www.kaggle.com/theworldbank/ibrd-statement-of-income-fy2013
    Explore at:
    zip(3239 bytes)Available download formats
    Dataset updated
    Apr 9, 2019
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Content

    Provides data from the IBRD Statement of Income for the fiscal years ended June 30, 2013, June 30, 2012 and June 30, 2011. The values are expressed in millions of U.S. Dollars. Where applicable, changes have been made to certain line items on FY 2012 income statement to conform with the current year's presentation, but the comparable prior years' data sets have not been adjusted to reflect the reclassification impact of those changes.

    Context

    This is a dataset hosted by the World Bank. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore World Bank's Financial Data using Kaggle and all of the data sources available through the World Bank organization page!

    • Update Frequency: This dataset is updated daily.

    Acknowledgements

    This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.

    This dataset is distributed under a Creative Commons Attribution 3.0 IGO license.

    Cover photo by Matt Artz on Unsplash
    Unsplash Images are distributed under a unique Unsplash License.

    This dataset is distributed under Creative Commons Attribution 3.0 IGO

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Netflix (2017). Netflix Prize data [Dataset]. https://www.kaggle.com/netflix-inc/netflix-prize-data
Organization logo

Netflix Prize data

Dataset from Netflix's competition to improve their reccommendation algorithm

Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jul 19, 2017
Dataset authored and provided by
Netflixhttp://netflix.com/
Description

Context

Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.

Content

This comes directly from the README:

TRAINING DATASET FILE DESCRIPTION

The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:

CustomerID,Rating,Date

  • MovieIDs range from 1 to 17770 sequentially.
  • CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
  • Ratings are on a five star (integral) scale from 1 to 5.
  • Dates have the format YYYY-MM-DD.

MOVIES FILE DESCRIPTION

Movie information in "movie_titles.txt" is in the following format:

MovieID,YearOfRelease,Title

  • MovieID do not correspond to actual Netflix movie ids or IMDB movie ids.
  • YearOfRelease can range from 1890 to 2005 and may correspond to the release of corresponding DVD, not necessarily its theaterical release.
  • Title is the Netflix movie title and may not correspond to titles used on other sites. Titles are in English.

QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION

The qualifying dataset for the Netflix Prize is contained in the text file "qualifying.txt". It consists of lines indicating a movie id, followed by a colon, and then customer ids and rating dates, one per line for that movie id. The movie and customer ids are contained in the training set. Of course the ratings are withheld. There are no empty lines in the file.

MovieID1:

CustomerID11,Date11

CustomerID12,Date12

...

MovieID2:

CustomerID21,Date21

CustomerID22,Date22

For the Netflix Prize, your program must predict the all ratings the customers gave the movies in the qualifying dataset based on the information in the training dataset.

The format of your submitted prediction file follows the movie and customer id, date order of the qualifying dataset. However, your predicted rating takes the place of the corresponding customer id (and date), one per line.

For example, if the qualifying dataset looked like:

111:

3245,2005-12-19

5666,2005-12-23

6789,2005-03-14

225:

1234,2005-05-26

3456,2005-11-07

then a prediction file should look something like:

111:

3.0

3.4

4.0

225:

1.0

2.0

which predicts that customer 3245 would have rated movie 111 3.0 stars on the 19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher at 3.4 stars on the 23rd of Decemeber, 2005, etc.

You must make predictions for all customers for all movies in the qualifying dataset.

THE PROBE DATASET FILE DESCRIPTION

To allow you to test your system before you submit a prediction set based on the qualifying dataset, we have provided a probe dataset in the file "probe.txt". This text file contains lines indicating a movie id, followed by a colon, and then customer ids, one per line for that movie id.

MovieID1:

CustomerID11

CustomerID12

...

MovieID2:

CustomerID21

CustomerID22

Like the qualifying dataset, the movie and customer id pairs are contained in the training set. However, unlike the qualifying dataset, the ratings (and dates) for each pair are contained in the training dataset.

If you wish, you may calculate the RMSE of your predictions against those ratings and compare your RMSE against the Cinematch RMSE on the same data. See http://www.netflixprize.com/faq#probe for that value.

Acknowledgements

The training data came in 17,000+ files. In the interest of keeping files together and file sizes as low as possible, I combined them into four text files: combined_data_(1,2,3,4).txt

The contest was originally hosted at http://netflixprize.com/index.html

The dataset was downloaded from https://archive.org/download/nf_prize_dataset.tar

Inspiration

This is a fun dataset to work with. You can read about the winning algorithm by BellKor's Pragmatic Chaos here

Search
Clear search
Close search
Google apps
Main menu