5 datasets found
  1. covid19_week4_mlandry

    • kaggle.com
    zip
    Updated Apr 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mlandry (2020). covid19_week4_mlandry [Dataset]. https://www.kaggle.com/mlandry/covid19-week4-mlandry
    Explore at:
    zip(441557 bytes)Available download formats
    Dataset updated
    Apr 15, 2020
    Authors
    mlandry
    Description
  2. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(167219625372 bytes)Available download formats
    Dataset updated
    Nov 27, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  3. City _of_LA_Readbility_Scores

    • kaggle.com
    zip
    Updated May 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dawn Moyer (2019). City _of_LA_Readbility_Scores [Dataset]. https://www.kaggle.com/silverfoxdss/city-of-la-readbility-scores
    Explore at:
    zip(1475788 bytes)Available download formats
    Dataset updated
    May 11, 2019
    Authors
    Dawn Moyer
    Area covered
    Los Angeles
    Description

    Context

    This dataset was created for https://www.kaggle.com/c/data-science-for-good-city-of-los-angeles.

    The input text and pdf files were loaded up to Readable.com during the 24-hour demo period that was free. Given the time constraint, I just uploaded and processed as many of the files as I could. I also had to export the resulting output files. I merged the results into this since .csv file.

    Usage can be seen in my kernel: https://www.kaggle.com/silverfoxdss/city-of-la-readability-recommendations

    Content

    Details on the columns and calculations are found on the website: https://readable.help/readability-scoring-algorithms-and-functionality

    Acknowledgements

    Huge shout to Readable.com!!!

  4. Dysonian SETI Candidates DR1

    • kaggle.com
    zip
    Updated Sep 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José H. Solórzano (2018). Dysonian SETI Candidates DR1 [Dataset]. https://www.kaggle.com/solorzano/dysonian-seti-candidates-dr1
    Explore at:
    zip(13023390 bytes)Available download formats
    Dataset updated
    Sep 15, 2018
    Authors
    José H. Solórzano
    Description

    Context

    Lists of Gaia DR2 stars that are anomalously dim/bright and whose neighbors have an unusual brightness distribution were produced in two Kaggle kernels. Candidates should be studied further to see if they are unusual in other ways. Their century-long behavior could be key in understanding why they are anomalous.

    "Dysonian SETI" refers to the search for extraterrestrial astroengineering that noticeably alters a star's brightness.

    Content

    The dataset includes CSV lists of Dysonian SETI candidates and ordinary controls. Additionally, century-long photometric observations were obtained from the DASCH archive and put in the dasch directory of the dataset.

    DASCH was searched using its default parameters (APASS input catalog, N >= 1, d <= 5). Star IDs were either of the form "Gaia DR2 ..." or "TYC ..." (a Tycho2 ID). A dasch_id column was added to candidate/control CSV files, containing IDs provided by DASCH in the "Catalog Query Results" table. If DASCH gives an error with text "Data for this region is not available" then a value of "None" will be found in the corresponding dasch_id cell.

    The "short" forms of tabular data files (.txt extension) were downloaded from the "DASCH Photometry Data for Catalog Query" page. File names start with "short_" followed by the star ID used in search, and contain the dasch_id value as well.

    In cases where the "Catalog Query Results" table gives multiple matches, the one with more data points was picked.

    Note about errors: If DASCH gives a "Unrecognized Identifier" error message, it could be a temporary glitch. As explained in the search page: "WARNING: The Simbad name resolution service has been intermittently unreliable from this web site. If you experience problems, please retrieve the coordinates directly from the Simbad reflector site."

    CSV files contained in the dataset are:

    Acknowledgments

    This work has made use of data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC, https://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement.

    References

    Tang, S., Grindlay, J., Los, E., et al. (2013). Improved Photometry for the DASCH Pipeline. PASP, 125(929):857–865.

  5. Yahoo NSFW as MobileNetV2 Bottlenecks

    • kaggle.com
    zip
    Updated Oct 3, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan Murray (2019). Yahoo NSFW as MobileNetV2 Bottlenecks [Dataset]. https://www.kaggle.com/datasets/nmurray1234/yahoo-nsfw-as-mobilenetv2-bottlenecks
    Explore at:
    zip(5554040674 bytes)Available download formats
    Dataset updated
    Oct 3, 2019
    Authors
    Nathan Murray
    Description

    Context

    This dataset is meant to aid development of effective and computationally light NSFW filtering that can be run on low powered devices. To understand why I'm posting this dataset, see this article.

    NSFW machine learning requires NSFW images, which are best not distributed on public sites (and usually against Terms of Service). Instead, this dataset contains the model outputs of 200K mostly pornographic images having been sent through the first layers of MobileNetV2. Additionally, the output of the Yahoo NSFW model are included.

    Transfer learning principles can then be applied to this dataset. Using the MobileNetV2 outputs as bottlenecks, and the Yahoo NSFW outputs as target values, one can build a model which tries to mimic the Yahoo NSFW model.

    Content

    The files are in several archives (it was the only way to upload this much data with a 2GB limit per file). Inside the archives are npz files (compressed numpy arrays), containing 2000 input and target tensors.

    Keras was used to create the MobileNetV2 output, and you can see in the tutorial kernel how it can be utilized.

  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
mlandry (2020). covid19_week4_mlandry [Dataset]. https://www.kaggle.com/mlandry/covid19-week4-mlandry
Organization logo

covid19_week4_mlandry

Explore at:
zip(441557 bytes)Available download formats
Dataset updated
Apr 15, 2020
Authors
mlandry
Description
Search
Clear search
Close search
Google apps
Main menu