axay/javascript-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains an anonymized list of surveyed developers who provided their expertise level on three popular JavaScript libraries:
ReactJS, a library for building enriched web interfaces
MongoDB, a driver for accessing MongoDB databased
Socket.IO, a library for realtime communication
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset is imported from CodeXGLUE and pre-processed using their script.
Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/javascript in Semeru
CodeXGLUE -- Code-To-Text
Task Definition
The task is to generate natural language comments for a code, and evaluted by smoothed bleu-4 score.
Dataset
The dataset we use comes from CodeSearchNet and we filter the dataset as the following:… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-text-javascript.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a sampled dataset collected by JSObserver on Alexa top 100K websites. We analyze the log files to identify JavaScript global identifier conflicts, i.e., variable value conflicts, variable type conflicts and function definition conflicts.
We release the log files on websites where we detect the above conflicts, and split the whole dataset into 10 subsets, i.e., 1-50K-0.zip ~ 50K-100K-4.zip.
The writes to a memory location in JavaScript are saved in [rank].[main/sub].[frame_cnt].asg (e.g., 1.main.0.asg) files.
JavaScript global function definitions are saved in [rank].[main/sub].[frame_cnt].func (e.g., 1.main.0.func) files.
The maps from script IDs to script URLs are saved in [rank].[main/sub].[frame_cnt].id2url (e.g., 1.main.0.id2url) files.
The source code of scripts are saved in [rank].[main/sub].[frame_cnt].[script_ID].script (e.g., 1.main.0.17.script) files.
We also sample 100 websites on which we did not detect any conflicts. The log files collected on those websites are available in sampled_no_conflict.zip
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of multiple files which contain bug prediction training data.
The entries in the dataset are JavaScript functions either being buggy or non-buggy. Bug related information was obtained from the project EsLint contained in BugsJS (https://github.com/BugsJS/eslint). The buggy instances were collected throughout the lifetime of the project, however we added non-buggy entries from the latest version which is tagged as fix (entries which were previously included as buggy were not included as non-buggy later on).
The dataset is based on hybrid call graphs which are constructed by https://github.com/sed-szeged/hcg-js-framework. The result of this tool is a call graph where the edges are associated with a confidence level which shows how likely the given edge is a valid call edge.
We used different threshold values from which we considered the edges to be valid. The following threshold values were used:
0.00
0.05
0.20
0.30
The prefix in the dataset file names are coming from the used threshold. The the datasets include coupling metrics NII (Nubmer of Incoming Invocations) and NOI (Number of Outgoing Invocations) which were calculated by a static source code analyzer called SourceMeter. Hybrid counterparts of these metrics (HNII and HNOI) are based on the given threshold values.
There are four variants for all of these datasets:
Both static (NII, NOi) and hybrid (HNII, HNOI) coupling metrics are included with additional static source code metrics and information about the entries (file without any postfix). Column contained only in this dataset are:
ID
Name
Longname
Parent ID
Component ID
Path
Line
Column
EndLine
EndColumn
Both static (NII, NOi) and hybrid (HNII, HNOI) coupling metrics are included with additional static source code metrics (file with '_h+s' postfix)
Only static (NII, NOI) coupling metrics are included with additional static source code metrics (file with '_s' postfix)
Only hybrid (HNII, HNOI) coupling metrics are included with additional static source code metrics (file with '_h' postfix)
Static source code metrics which are contained in all dataset are the following:
McCC - McCabe Cyclomatic Complexity
NL - Nesting Level
NLE - Nesting Level Else If
CD - Comment Density
CLOC - Comment Lines of Code
DLOC - Documentation Lines of Code
TCD - Total Comment Density (Comment Lines in an emedded function will be also considered)
TCLOC - Total Comment Lines of Code (Comment Lines in an emedded function will be also considered)
LLOC - Logical Lines of Code (Comment and empty lines not counted)
LOC - Lines of Code (Comment and empty lines are counted)
NOS - Number of Statements
NUMPAR - Number of Parameters
TLLOC - Logical Lines of Code (Lines in embedded functions are also counted)
TLOC - Lines of Code (Lines in embedded functions are also counted)
TNOS - Total Number of Statements (Statements in embedded functions are also counted)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Reliable JavaScript. It features 7 columns including author, publication date, language, and book publisher.
The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
DPO JavaScript Dataset
This repository contains a modified version of the JavaScript dataset originally sourced from axay/javascript-dataset-pn. The dataset has been adapted to fit the DPO (Dynamic Programming Object) format, making it compatible with the LLaMA-Factory project.
License
This dataset is licensed under the Apache 2.0 License.
Dataset Overview
The dataset consists of JavaScript code snippets that have been restructured and enhanced for use in… See the full description on the dataset page: https://huggingface.co/datasets/israellaguan/axay-javascript-dataset-pn.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset we used in our paper entitled "Towards a Prototype Based Explainable JavaScript Vulnerability Prediction Model". The manually validated dataset contains various several static source code metrics along with vulnerability fixing hashes for numerous vulnerabilities. For more details, you can read the paper here.
Security has become a central and unavoidable aspect of today’s software development. Practitioners and researchers have proposed many code analysis tools and techniques to mitigate security risks. These tools apply static and dynamic analysis or, more recently, machine learning. Machine learning models can achieve impressive results in finding and forecasting possible security issues in programs. However, there are at least two areas where most of the current approaches fall short of developer demands: explainability and granularity of predictions. In this paper, we propose a novel and simple yet, promising approach to identify potentially vulnerable source code in JavaScript programs. The model improves the state-of-the-art in terms of explainability and prediction granularity as it gives results at the level of individual source code lines, which is fine-grained enough for developers to take immediate actions. Additionally, the model explains each predicted line (i.e., provides the most similar vulnerable line from the training set) using a prototype-based approach. In a study of 186 real-world and confirmed JavaScript vulnerability fixes of 91 projects, the approach could flag 60% of the known vulnerable lines on average by marking only 10% of the code-base, but in certain cases the model identified 100% of the vulnerable code lines while flagging only 8.72% of the code-base.
If you wish to use our dataset, please cite this dataset, or the corresponding paper:
@inproceedings{mosolygo2021towards,
title={Towards a Prototype Based Explainable JavaScript Vulnerability Prediction Model},
author={Mosolyg{\'o}, Bal{\'a}zs and V{\'a}ndor, Norbert and Antal, G{\'a}bor and Heged{\H{u}}s, P{\'e}ter and Ferenc, Rudolf},
booktitle={2021 International Conference on Code Quality (ICCQ)},
pages={15--25},
year={2021},
organization={IEEE}
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains: 1) the object access logs, 2) script isolation policies and 3) script write conflicts collected by JSIsolate on Alexa top 1K websites. We analyze the access logs to generate the conflict summary files and script isolation policies that assign static scripts to an execution context.
We split the whole dataset of object access logs into 10 subsets, i.e., access-0.zip ~ access-9.zip.
The isolation policies are released in url-level-policies.zip and domain-level-policies.zip.
The object accesses (i.e., reads and writes) are saved in [rank].[main/sub].[frame_cnt].access (e.g., 1.main.0.access) files.
The URLs of frames (i.e., main frames and iframes) are saved in [rank].[main/sub].[frame_cnt].frame (e.g., 1.main.0.frame) files.
The maps from script IDs to script URLs are saved in [rank].[main/sub].[frame_cnt].id2url (e.g., 1.main.0.id2url) files.
The maps from script IDs to their parent script (script that includes it,
The source code of scripts are saved in [rank].[main/sub].[frame_cnt].[script_ID].script (e.g., 1.main.0.17.script) files.
Note that we perform monkey testing during the data collection, which may cause the page to navigate to a different URL. Therefore, there could be multiple main frame files.
The conflicts are dumped to [rank].conflicts (e.g., 1.conflicts) files.
The isolation policies are dumped to [rank].configs (e.g., 1.configs) and [rank].configs-simple (e.g., 1.configs-simple) files.
Note that the *.configs files also include the read/write operations that cause JSIsolate to assign a script from third-party domain to the first-party context.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is The joy of JavaScript. It features 7 columns including author, publication date, language, and book publisher.
Content of this repository This is the repository that contains the scripts and dataset for the MSR 2019 mining challenge Github Repository with the software used : here. ======= DATASET The dataset was retrived utilizing google bigquery and dumped to a csv file for further processing, this original file with no treatment is called jsanswers.csv, here we can find the following information : 1. The Id of the question (PostId) 2. The Content (in this case the code block) 3. the lenght of the code block 4. the line count of the code block 5. The score of the post 6. The title A quick look at this files, one can notice that a postID can have multiple rows related to it, that's how multiple codeblocks are saved in the database. Filtered Dataset: Extracting code from CSV We used a python script called "ExtractCodeFromCSV.py" to extract the code from the original csv and merge all the codeblocks in their respective javascript file with the postID as name, this resulted in 336 thousand files. Running ESlint Due to the single threaded nature of ESlint, we needed to create a script to run ESlint because it took a huge toll on the machine to run it on 336 thousand files, this script is named "ESlintRunnerScript.py", it splits the files in 20 evenly distributed parts and runs 20 processes of esLinter to generate the reports, as such it generates 20 json files. Number of Violations per Rule This information was extracted using the script named "parser.py", it generated the file named "NumberofViolationsPerRule.csv" which contains the number of violations per rule used in the linter configuration in the dataset. Number of violations per Category As a way to make relevant statistics of the dataset, we generated the number of violations per rule category as defined in the eslinter website, this information was extracted using the same "parser.py" script. Individual Reports This information was extracted from the json reports, it's a csv file with PostID and violations per rule. Rules The file Rules with categories contains all the rules used and their categories.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Eloquent JavaScript : a modern introduction to programming. It features 7 columns including author, publication date, language, and book publisher.
angie-chen55/javascript-github-code dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 5 rows and is filtered where the books is Beginning JavaScript and CSS development with jQuery. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Darien Schettler
Released under CC0: Public Domain
Dataset Card for dataset-JavaScript-general-coding
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dmeldrum6/dataset-JavaScript-general-coding/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dmeldrum6/dataset-JavaScript-general-coding.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experimental results of model comparison before and after abstract syntax tree recombination.
This dataset was created by Jordan Tantuico
axay/javascript-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community