MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Vault is a multilingual code-text dataset with over 40 million pairs covering 10 popular programming languages. It is the largest corpus containing parallel code-text data. By building upon The Stack, a massive raw code sample collection, the Vault offers a comprehensive and clean resource for advancing research in code understanding and generation. It provides a high-quality dataset that includes code-text pairs at multiple levels, such as class and inline-level, in addition to the function level. The Vault can serve many purposes at multiple levels.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Vault is a multilingual code-text dataset with over 40 million pairs covering 10 popular programming languages. It is the largest corpus containing parallel code-text data. By building upon The Stack, a massive raw code sample collection, the Vault offers a comprehensive and clean resource for advancing research in code understanding and generation. It provides a high-quality dataset that includes code-text pairs at multiple levels, such as class and inline-level, in addition to the function level. The Vault can serve many purposes at multiple levels.