Facebook
Twitteraxay/javascript-dataset-js dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset was created by sSchumat
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains an anonymized list of surveyed developers who provided their expertise level on three popular JavaScript libraries:
ReactJS, a library for building enriched web interfaces
MongoDB, a driver for accessing MongoDB databased
Socket.IO, a library for realtime communication
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
JavaScript dataset.
Facebook
TwitterDataset Card for dataset-JavaScript-general-coding
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dmeldrum6/dataset-JavaScript-general-coding/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dmeldrum6/dataset-JavaScript-general-coding.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of multiple files which contain bug prediction training data.
The entries in the dataset are JavaScript functions either being buggy or non-buggy. Bug related information was obtained from the project EsLint contained in BugsJS (https://github.com/BugsJS/eslint). The buggy instances were collected throughout the lifetime of the project, however we added non-buggy entries from the latest version which is tagged as fix (entries which were previously included as buggy were not included as non-buggy later on).
The dataset is based on hybrid call graphs which are constructed by https://github.com/sed-szeged/hcg-js-framework. The result of this tool is a call graph where the edges are associated with a confidence level which shows how likely the given edge is a valid call edge.
We used different threshold values from which we considered the edges to be valid. The following threshold values were used:
0.00
0.05
0.20
0.30
The prefix in the dataset file names are coming from the used threshold. The the datasets include coupling metrics NII (Nubmer of Incoming Invocations) and NOI (Number of Outgoing Invocations) which were calculated by a static source code analyzer called SourceMeter. Hybrid counterparts of these metrics (HNII and HNOI) are based on the given threshold values.
There are four variants for all of these datasets:
Both static (NII, NOi) and hybrid (HNII, HNOI) coupling metrics are included with additional static source code metrics and information about the entries (file without any postfix). Column contained only in this dataset are:
ID
Name
Longname
Parent ID
Component ID
Path
Line
Column
EndLine
EndColumn
Both static (NII, NOi) and hybrid (HNII, HNOI) coupling metrics are included with additional static source code metrics (file with '_h+s' postfix)
Only static (NII, NOI) coupling metrics are included with additional static source code metrics (file with '_s' postfix)
Only hybrid (HNII, HNOI) coupling metrics are included with additional static source code metrics (file with '_h' postfix)
Static source code metrics which are contained in all dataset are the following:
McCC - McCabe Cyclomatic Complexity
NL - Nesting Level
NLE - Nesting Level Else If
CD - Comment Density
CLOC - Comment Lines of Code
DLOC - Documentation Lines of Code
TCD - Total Comment Density (Comment Lines in an emedded function will be also considered)
TCLOC - Total Comment Lines of Code (Comment Lines in an emedded function will be also considered)
LLOC - Logical Lines of Code (Comment and empty lines not counted)
LOC - Lines of Code (Comment and empty lines are counted)
NOS - Number of Statements
NUMPAR - Number of Parameters
TLLOC - Logical Lines of Code (Lines in embedded functions are also counted)
TLOC - Lines of Code (Lines in embedded functions are also counted)
TNOS - Total Number of Statements (Statements in embedded functions are also counted)
Facebook
Twitterangie-chen55/javascript-github-code dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Our chatbot will be trained on a specialized Q&A dataset about the React JavaScript library. This React Q&A dataset is provided as a JSON file containing roughly 26,300 question-answer pairs (the exact number may vary slightly). Each entry in the JSON list has a "question" field and a corresponding "answer" field, e.g.:
{"question": "What is React?", "answer": "React is an open-source JavaScript library for building user interfaces..."}
This format (a list of objects with ‘question’ and ‘answer’ strings) is common in QA collections. For comparison, a well-known QA dataset like SQuAD (Stanford Question Answering Dataset) contains on the order of 100,000 question-answer pairs. Our React dataset is smaller but still substantial. It covers many topics relevant to React: definitions (e.g. “What is JSX?”), how-to guides (e.g. “How to install react-datepicker?”), component usage, common patterns, troubleshooting, and performance features.
| Aspect | Details |
|---|---|
| Dataset Size | ~26,300 question-answer pairs |
| Format | JSON list; each entry has question and answer fields |
| Domain | React.js (theoretical and practical Q&A) |
| Examples | What is React?, How to install react-datepicker?, etc. |
Because this dataset is domain-specific (about React), it serves as a tailored knowledge base for the chatbot. Using a focused corpus like this is recommended: as noted by experts, “if your QA system focuses on a particular domain (e.g., technical), consider domain-specific corpora” and even curate your own Q&A pairs. This helps the model learn React terminology and concepts deeply. The dataset’s JSON structure (a flat list of QA entries) is simple and ready for loading into typical training pipelines.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Eloquent JavaScript : a modern introduction to programming. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterThis dataset was created by Jordan Tantuico
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "code-search-net-javascript"
Dataset Summary
This dataset is the JavaScript portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.
Languages
The dataset's comments are in English and the functions are coded in JavaScript
Data Splits
Train, test, validation labels are… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-javascript.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset is imported from CodeXGLUE and pre-processed using their script.
Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-text/javascript in Semeru
CodeXGLUE -- Code-To-Text
Task Definition
The task is to generate natural language comments for a code, and evaluted by smoothed bleu-4 score.
Dataset
The dataset we use comes from CodeSearchNet and we filter the dataset as the following:… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-text-javascript.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Reliable JavaScript. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
External JavaScript imports extracted from CommonCrawl CC-MAIN-2024-10 and CC-MAIN-2024-18 in CSV format.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 2 rows and is filtered where the books is Learning JavaScript. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Overview The Vulnerability Fix Dataset is a collection of 35,000 code snippets containing both vulnerable and fixed versions of code. The dataset focuses on common software security vulnerabilities and their corresponding fixes, making it highly valuable for research in secure coding practices, automated vulnerability detection, and software security analysis. ** Dataset Structure** This dataset consists of three main columns:
vulnerability_type: The type of security vulnerability (e.g., SQL Injection, Cross-Site Scripting). vulnerable_code: The original code snippet containing the vulnerability. fixed_code: The secure version of the code with the vulnerability fixed. The dataset includes vulnerabilities across multiple programming languages, making it useful for machine learning, static analysis, and cybersecurity training.
Features of the Dataset The Vulnerability Fix Dataset contains the following key features:
vulnerability_type (String)
The category of the security vulnerability present in the code. Examples: SQL Injection Cross-Site Scripting (XSS) Buffer Overflow Command Injection Insecure Cryptographic Practices vulnerable_code (String)
The original code snippet that contains a security vulnerability. Written in various programming languages, including Java, Python, C, and JavaScript. Used for analyzing insecure coding patterns. fixed_code (String)
The corrected version of the vulnerable_code with security improvements. Demonstrates best practices in secure coding. Helps in training AI models for automatic vulnerability fixing. This dataset is structured to support research in automated vulnerability detection, static code analysis, and secure software development.
Facebook
TwitterThe dataset contains obfuscated and non obfuscated files in clearly divided directories.
The file is taken from dataset provided in the book : Machine learning for cyber security cookbook
Facebook
TwitterThis dataset contains the predicted prices of the asset JavaScript over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Facebook
TwitterThis dataset contains daily snapshots of offers scraped from JustJoinIT - one of the biggest IT job board in Poland. Dataset covers variety of programming languages or areas offers (Java, C#, Python, JavaScript, data engineering and more).
Job offers were fetched from an API endpoint that exposed all job offers. I created a simple AWS lambda function that was invoked once per day and persisted extracted data on S3. Data is raw - the original JSON served by the API was saved on S3 and there was no processing in between.
First captured day: 23rd of October, 2021. Last captured day: 25th of September, 2023.
Dataset is incomplete (due to lack of retry in data fetching script). Missing days:
2022-06-05
2022-09-12
2022-10-03
2022-10-10
2022-10-14
2022-10-17
2022-10-22
2022-10-23
2022-10-25
2022-10-29
2022-11-06
2022-11-12
2022-11-13
2022-12-11
2022-12-18
2022-12-26
2023-02-04
2023-02-07
2023-02-08
2023-02-26
2023-03-11
2023-03-12
2023-03-27
2023-04-03
2023-04-12
2023-04-14
2023-04-17
2023-04-19
2023-04-20
2023-04-21
2023-04-22
2023-04-24
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Iris Dataset consists of 150 iris samples, each having four numerical features: sepal length, sepal width, petal length, and petal width. Each sample is categorized into one of three iris species: Setosa, Versicolor, or Virginica. This dataset is widely used as a sample dataset in machine learning and statistics due to its simple and easily understandable structure.
Feature Information : - Sepal Length (cm) - Sepal Width (cm) - Petal Length (cm) - Petal Width (cm)
Target Information : - Iris Species : 1. Setosa 1. Versicolor 1. Virginica
Source : The Iris Dataset is obtained from the scikit-learn (sklearn) library under the BSD (Berkeley Software Distribution) license.
File Formats :
The Iris Dataset is one of the most iconic datasets in the world of machine learning and data science. This dataset contains information about three species of iris flowers: Setosa, Versicolor, and Virginica. With features like sepal and petal length and width, the Iris dataset has been a stepping stone for many beginners in understanding the fundamental concepts of classification and data analysis. With its clarity and diversity of features, the Iris dataset is perfect for exploring various machine learning techniques and building accurate classification models. I present the Iris dataset from scikit-learn with the hope of providing an enjoyable and inspiring learning experience for the Kaggle community!
Facebook
Twitteraxay/javascript-dataset-js dataset hosted on Hugging Face and contributed by the HF Datasets community