Facebook
TwitterFinal blends that blended the output file in this kernel: https://www.kaggle.com/david1013/beluga-covid-19-w3-a-few-charts-and-submission (v2)
And my own predictions: notebook: https://www.kaggle.com/mlandry/kernel2b47680e15 output: https://www.kaggle.com/mlandry/covid19-week3-predictions-mlandry
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
Facebook
TwitterThis dataset was created for https://www.kaggle.com/c/data-science-for-good-city-of-los-angeles.
The input text and pdf files were loaded up to Readable.com during the 24-hour demo period that was free. Given the time constraint, I just uploaded and processed as many of the files as I could. I also had to export the resulting output files. I merged the results into this since .csv file.
Usage can be seen in my kernel: https://www.kaggle.com/silverfoxdss/city-of-la-readability-recommendations
Details on the columns and calculations are found on the website: https://readable.help/readability-scoring-algorithms-and-functionality
Huge shout to Readable.com!!!
Facebook
TwitterLists of Gaia DR2 stars that are anomalously dim/bright and whose neighbors have an unusual brightness distribution were produced in two Kaggle kernels. Candidates should be studied further to see if they are unusual in other ways. Their century-long behavior could be key in understanding why they are anomalous.
"Dysonian SETI" refers to the search for extraterrestrial astroengineering that noticeably alters a star's brightness.
The dataset includes CSV lists of Dysonian SETI candidates and ordinary controls. Additionally, century-long photometric observations were obtained from the DASCH archive and put in the dasch directory of the dataset.
DASCH was searched using its default parameters (APASS input catalog, N >= 1, d <= 5). Star IDs were either of the form "Gaia DR2 ..." or "TYC ..." (a Tycho2 ID). A dasch_id column was added to candidate/control CSV files, containing IDs provided by DASCH in the "Catalog Query Results" table. If DASCH gives an error with text "Data for this region is not available" then a value of "None" will be found in the corresponding dasch_id cell.
The "short" forms of tabular data files (.txt extension) were downloaded from the "DASCH Photometry Data for Catalog Query" page. File names start with "short_" followed by the star ID used in search, and contain the dasch_id value as well.
In cases where the "Catalog Query Results" table gives multiple matches, the one with more data points was picked.
Note about errors: If DASCH gives a "Unrecognized Identifier" error message, it could be a temporary glitch. As explained in the search page: "WARNING: The Simbad name resolution service has been intermittently unreliable from this web site. If you experience problems, please retrieve the coordinates directly from the Simbad reflector site."
CSV files contained in the dataset are:
clustered-dim-candidates.csv: The list of candidates from kernel "https://www.kaggle.com/solorzano/multi-stellar-seti-candidate-selection-part-2">Multi-Stellar SETI Candidate Selection Part 2. Corresponding DASCH files were searched using Tycho2 IDs the week of 9/3/2018.
clustered-bright-candidates.csv: The list of candidates from kernel "https://www.kaggle.com/solorzano/multi-stellar-seti-candidate-selection-part-3">Multi-Stellar SETI Candidate Selection Part 3. Corresponding DASCH files were searched using Tycho2 IDs the week of 9/3/2018.
naive-candidates.csv: A list of anomalously dim candidates obtained via a simple 3-sigma threshold method in "https://www.kaggle.com/solorzano/dysonian-seti-with-machine-learning/output?scriptVersionId=4507617">Version 3 of an early Dysonian SETI kernel. Corresponding DASCH files were searched using Gaia DR2 IDs the week of 7/9/2018.
normal-controls.csv: A list of ordinary stars obtained via random sampling in "https://www.kaggle.com/solorzano/dysonian-seti-with-machine-learning/output?scriptVersionId=4507617">Version 3 of an early Dysonian SETI kernel. Corresponding DASCH files were searched using Gaia DR2 IDs the week of 7/16/2018.
This work has made use of data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC, https://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement.
Tang, S., Grindlay, J., Los, E., et al. (2013). Improved Photometry for the DASCH Pipeline. PASP, 125(929):857–865.
Facebook
TwitterThis dataset is meant to aid development of effective and computationally light NSFW filtering that can be run on low powered devices. To understand why I'm posting this dataset, see this article.
NSFW machine learning requires NSFW images, which are best not distributed on public sites (and usually against Terms of Service). Instead, this dataset contains the model outputs of 200K mostly pornographic images having been sent through the first layers of MobileNetV2. Additionally, the output of the Yahoo NSFW model are included.
Transfer learning principles can then be applied to this dataset. Using the MobileNetV2 outputs as bottlenecks, and the Yahoo NSFW outputs as target values, one can build a model which tries to mimic the Yahoo NSFW model.
The files are in several archives (it was the only way to upload this much data with a 2GB limit per file). Inside the archives are npz files (compressed numpy arrays), containing 2000 input and target tensors.
Keras was used to create the MobileNetV2 output, and you can see in the tutorial kernel how it can be utilized.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterFinal blends that blended the output file in this kernel: https://www.kaggle.com/david1013/beluga-covid-19-w3-a-few-charts-and-submission (v2)
And my own predictions: notebook: https://www.kaggle.com/mlandry/kernel2b47680e15 output: https://www.kaggle.com/mlandry/covid19-week3-predictions-mlandry