Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AbstractThe H1B is an employment-based visa category for temporary foreign workers in the United States. Every year, the US immigration department receives over 200,000 petitions and selects 85,000 applications through a random process and the U.S. employer must submit a petition for an H1B visa to the US immigration department. This is the most common visa status applied to international students once they complete college or higher education and begin working in a full-time position. The project provides essential information on job titles, preferred regions of settlement, foreign applicants and employers' trends for H1B visa application. According to locations, employers, job titles and salary range make up most of the H1B petitions, so different visualization utilizing tools will be used in order to analyze and interpreted in relation to the trends of the H1B visa to provide a recommendation to the applicant. This report is the base of the project for Visualization of Complex Data class at the George Washington University, some examples in this project has an analysis for the different relevant variables (Case Status, Employer Name, SOC name, Job Title, Prevailing Wage, Worksite, and Latitude and Longitude information) from Kaggle and Office of Foreign Labor Certification(OFLC) in order to see the H1B visa changes in the past several decades. Keywords: H1B visa, Data Analysis, Visualization of Complex Data, HTML, JavaScript, CSS, Tableau, D3.jsDatasetThe dataset contains 10 columns and covers a total of 3 million records spanning from 2011-2016. The relevant columns in the dataset include case status, employer name, SOC name, jobe title, full time position, prevailing wage, year, worksite, and latitude and longitude information.Link to dataset: https://www.kaggle.com/nsharan/h-1b-visaLink to dataset(FY2017): https://www.foreignlaborcert.doleta.gov/performancedata.cfmRunning the codeOpen Index.htmlData ProcessingDoing some data preprocessing to transform the raw data into an understandable format.Find and combine any other external datasets to enrich the analysis such as dataset of FY2017.To make appropriated Visualizations, variables should be Developed and compiled into visualization programs.Draw a geo map and scatter plot to compare the fastest growth in fixed value and in percentages.Extract some aspects and analyze the changes in employers’ preference as well as forecasts for the future trends.VisualizationsCombo chart: this chart shows the overall volume of receipts and approvals rate.Scatter plot: scatter plot shows the beneficiary country of birth.Geo map: this map shows All States of H1B petitions filed.Line chart: this chart shows top10 states of H1B petitions filed. Pie chart: this chart shows comparison of Education level and occupations for petitions FY2011 vs FY2017.Tree map: tree map shows overall top employers who submit the greatest number of applications.Side-by-side bar chart: this chart shows overall comparison of Data Scientist and Data Analyst.Highlight table: this table shows mean wage of a Data Scientist and Data Analyst with case status certified.Bubble chart: this chart shows top10 companies for Data Scientist and Data Analyst.Related ResearchThe H-1B Visa Debate, Explained - Harvard Business Reviewhttps://hbr.org/2017/05/the-h-1b-visa-debate-explainedForeign Labor Certification Data Centerhttps://www.foreignlaborcert.doleta.govKey facts about the U.S. H-1B visa programhttp://www.pewresearch.org/fact-tank/2017/04/27/key-facts-about-the-u-s-h-1b-visa-program/H1B visa News and Updates from The Economic Timeshttps://economictimes.indiatimes.com/topic/H1B-visa/newsH-1B visa - Wikipediahttps://en.wikipedia.org/wiki/H-1B_visaKey FindingsFrom the analysis, the government is cutting down the number of approvals for H1B on 2017.In the past decade, due to the nature of demand for high-skilled workers, visa holders have clustered in STEM fields and come mostly from countries in Asia such as China and India.Technical Jobs fill up the majority of Top 10 Jobs among foreign workers such as Computer Systems Analyst and Software Developers.The employers located in the metro areas thrive to find foreign workforce who can fill the technical position that they have in their organization.States like California, New York, Washington, New Jersey, Massachusetts, Illinois, and Texas are the prime location for foreign workers and provide many job opportunities. Top Companies such Infosys, Tata, IBM India that submit most H1B Visa Applications are companies based in India associated with software and IT services.Data Scientist position has experienced an exponential growth in terms of H1B visa applications and jobs are clustered in West region with the highest number.Visualization utilizing programsHTML, JavaScript, CSS, D3.js, Google API, Python, R, and Tableau
Facebook
TwitterThe datatablesview extension for CKAN enhances the display of tabular datasets within CKAN by integrating the DataTables JavaScript library. As a fork of a previous DataTables CKAN plugin, this extension aims to provide improved functionality and maintainability for presenting data in a user-friendly and interactive tabular format. This tool focuses on making data more accessible and easier to explore directly within the CKAN interface. Key Features: Enhanced Data Visualization: Transforms standard CKAN dataset views into interactive tables using the DataTables library, providing a more engaging user experience compared to plain HTML tables. Interactive Table Functionality: Includes features such as sorting, filtering, and pagination within the data table, allowing users to easily navigate and analyze large datasets directly in the browser. Improved Data Accessibility: Makes tabular data more accessible to a wider range of users by providing intuitive tools to explore and understand the information. Presumed Customizable Appearance: Given that it is based on DataTables, users will likely be able to customize the look and feel of the tables through DataTables configuration options (note: this is an assumption based on standard DataTables usage and may require coding). Use Cases (based on typical DataTables applications): Government Data Portals: Display complex government datasets in a format that is easy for citizens to search, filter, and understand, enhancing transparency and promoting data-driven decision-making. For example, presenting financial data, population statistics, or environmental monitoring results. Research Data Repositories: Allow researchers to quickly explore and analyze large scientific datasets directly within the CKAN interface, facilitating data discovery and collaboration. Corporate Data Catalogs: Enable business users to easily access and manipulate tabular data relevant to their roles, improving data literacy and enabling data-informed business strategies. Technical Integration (inferred from CKAN extension structure): The extension likely operates by leveraging CKAN's plugin architecture to override the default dataset view for tabular data. Its implementation likely uses CKAN's templating system to render datasets using DataTables' JavaScript and CSS, enhancing data-viewing experience. Benefits & Impact: By implementing the datatablesview extension, organizations can improve the user experience when accessing and exploring tabular datasets within their CKAN instances. The enhanced interactivity and data exploration features can lead to increased data utilization, improved data literacy, and more effective data-driven decision-making within organizations and communities.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT
End-to-End (E2E) testing is a comprehensive approach to validating the functionality of a software application by testing its entire workflow from the user’s perspective, ensuring that all integrated components work together as expected. It is crucial for ensuring the quality and reliability of applications, especially in the web domain, which is often bound by Service Level Agreements (SLAs). This testing involves two key activities:
Graphical User Interface (GUI) testing, which simulates user interactions through browsers, and performance testing, which evaluates system workload handling. Despite its importance, E2E testing is often neglected, and the lack of reliable datasets for Web GUI and performance testing has slowed research progress. This paper addresses these limitations by constructing E2EGit, a comprehensive dataset, cataloging non-trivial open-source web projects on GITHUB that adopt GUI or performance testing.
The dataset construction process involved analyzing over 5k non-trivial web repositories based on popular programming languages (JAVA, JAVASCRIPT TYPESCRIPT PYTHON) to identify: 1) GUI tests based on popular browser automation frameworks (SELENIUM PLAYWRIGHT, CYPRESS, PUPPETEER), 2) performance tests written with the most popular open-source tools (JMETER, LOCUST). After analysis, we identified 472 repositories using web GUI testing, with over 43,000 tests, and 84 repositories using performance testing, with 410 tests.
DATASET DESCRIPTION
The dataset is provided as an SQLite database, whose structure is illustrated in Figure 3 (in the paper), which consists of five tables, each serving a specific purpose.
The repository table contains information on 1.5 million repositories collected using the SEART tool on May 4. It includes 34 fields detailing repository characteristics. The
non_trivial_repository table is a subset of the previous one, listing repositories that passed the two filtering stages described in the pipeline. For each repository, it specifies whether it is a web repository using JAVA, JAVASCRIPT, TYPESCRIPT, or PYTHON frameworks. A repository may use multiple frameworks, with corresponding fields (e.g., is web java) set to true, and the field web dependencies listing the detected web frameworks. For Web GUI testing, the dataset includes two additional tables; gui_testing_test _details, where each row represents a test file, providing the file path, the browser automation framework used, the test engine employed, and the number of tests implemented in the file. gui_testing_repo_details, aggregating data from the previous table at the repository level. Each of the 472 repositories has a row summarizing
the number of test files using frameworks like SELENIUM or PLAYWRIGHT, test engines like JUNIT, and the total number of tests identified. For performance testing, the performance_testing_test_details table contains 410 rows, one for each test identified. Each row includes the file path, whether the test uses JMETER or LOCUST, and extracted details such as the number of thread groups, concurrent users, and requests. Notably, some fields may be absent—for instance, if external files (e.g., CSVs defining workloads) were unavailable, or in the case of Locust tests, where parameters like duration and concurrent users are specified via the command line.
To cite this article refer to this citation:
@inproceedings{di2025e2egit,
title={E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects},
author={Di Meglio, Sergio and Starace, Luigi Libero Lucio and Pontillo, Valeria and Opdebeeck, Ruben and De Roover, Coen and Di Martino, Sergio},
booktitle={2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR)},
pages={10--15},
year={2025},
organization={IEEE/ACM}
}
This work has been partially supported by the Italian PNRR MUR project PE0000013-FAIR.
Facebook
TwitterThe ckanext-plotly extension provides CKAN users with the ability to create and view charts using plotly.js and the Plotly react-chart-editor. It enables integration of interactive data visualization directly within the CKAN platform, making data more accessible and understandable. The extension leverages the capabilities of Plotly's JavaScript library to generate rich and customizable charts from CKAN datasets. Key Features: Plotly.js Integration: Uses plotly.js, an open-source JavaScript charting library, to render a wide variety of interactive plots directly within CKAN.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Overview The Vulnerability Fix Dataset is a collection of 35,000 code snippets containing both vulnerable and fixed versions of code. The dataset focuses on common software security vulnerabilities and their corresponding fixes, making it highly valuable for research in secure coding practices, automated vulnerability detection, and software security analysis. ** Dataset Structure** This dataset consists of three main columns:
vulnerability_type: The type of security vulnerability (e.g., SQL Injection, Cross-Site Scripting). vulnerable_code: The original code snippet containing the vulnerability. fixed_code: The secure version of the code with the vulnerability fixed. The dataset includes vulnerabilities across multiple programming languages, making it useful for machine learning, static analysis, and cybersecurity training.
Features of the Dataset The Vulnerability Fix Dataset contains the following key features:
vulnerability_type (String)
The category of the security vulnerability present in the code. Examples: SQL Injection Cross-Site Scripting (XSS) Buffer Overflow Command Injection Insecure Cryptographic Practices vulnerable_code (String)
The original code snippet that contains a security vulnerability. Written in various programming languages, including Java, Python, C, and JavaScript. Used for analyzing insecure coding patterns. fixed_code (String)
The corrected version of the vulnerable_code with security improvements. Demonstrates best practices in secure coding. Helps in training AI models for automatic vulnerability fixing. This dataset is structured to support research in automated vulnerability detection, static code analysis, and secure software development.
Facebook
TwitterStRuCom
Task description
The dataset contains structured Russian-language docstrings for functions in 5 programming languages (Python, Java, C#, Go, JavaScript). Dataset contains 500 tasks. Key features:
First specialized corpus for Russian-language documentation Combination of real GitHub data (for testing) and synthetic data from Qwen2.5-Coder-32B-Instruct (for training) Strict filtering for completeness and compliance with documentation standards All comments conform… See the full description on the dataset page: https://huggingface.co/datasets/MERA-evaluation/StRuCom.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used for the master's thesis "LLMs for Code Comment Consistency." Covers the languages Go, Java, JavaScript, TypeScript, and Python. All data is mined from permissively-licensed GitHub public projects with at least 25 stars and 25 pull requests submitted at the time of access, based on the GitHub Public Repository Metadata Dataset.
This dataset pertains specifically to pull request comments that are made on files. In other words, every comment in this dataset is linked to a specific file in a pull request.
Anything you want, of course, but here are some starter ideas: - Sentiment analysis of comments, is there a correlation between number of contributions and positivity of reviews? - Pull request comment generation: can we automatically make code review comments? - PR text mining: can we mine out examples of a specific type of comment? (in my project, this was comments about function documentation)
The mining code is publicly accessible at https://github.com/pelmers/llms-for-code-comment-consistency/tree/main/rq3
Each file is a JSON object where each key is a Github repository, and each value is a pull request comment in that repository.
Example:
{
"trekhleb/javascript-algorithms": [{
"html_url": "https://github.com/trekhleb/javascript-algorithms/pull/101#discussion_r204437121",
"path": "src/algorithms/string/knuth-morris-pratt/knuthMorrisPratt.js",
"line": 33,
"body": "Please take a look at the comments to the tests above. No need to do this checking.",
"user": "trekhleb",
"diff_hunk": "@@ -30,6 +30,10 @@ function buildPatternTable(word) {
* @return {number}
*/
export default function knuthMorrisPratt(text, word) {
+ if (word.length === 0) {",
"author_association": "OWNER",
"commit_id": "618d0962025ff1116979560a0bfa0ed1660f129e",
"id": 204437121,
"repo": "trekhleb/javascript-algorithms"
}, ...]
}
Facebook
TwitterThe ckanext-papaya extension enhances CKAN by providing specialized viewers for medical imaging data. Specifically, it enables the display of NIFTI (.nii) and DICOM (.dcm) files directly within the CKAN interface using the Papaya JavaScript viewer. The extension supports both single DICOM files and DICOM archives uploaded as ZIP files, facilitating easy access to and visualization of medical imaging datasets. Key Features: NIFTI and DICOM Viewing: Renders NIFTI (.nii) and DICOM (.dcm) files directly in CKAN using the Papaya viewer. DICOM ZIP Archive Support: Allows users to upload ZIP archives containing DICOM files, which are then extracted and displayed using Papaya. Only files with the .dcm extension within the ZIP are read, ignoring other file types. Automatic View Creation: Automatically creates Papaya views for newly-uploaded NIFTI files, single DICOM files, and DICOM ZIP archives. Note existing resources may need the view to be added manually. Client-Side Rendering: Leverages the Papaya JavaScript framework for client-side rendering of medical images, eliminating the need for a separate server. This provides a streamlined visualization experience directly within the user's browser. Temporary Unzipping Mechanism: The extension unzips DICOM archives temporarily to extract DICOM files for viewing, and immediately deletes these extracted files to conserve server space and maintain security.. Technical Integration: The ckanext-papaya extension integrates with CKAN by adding a new view type that utilizes the Papaya JavaScript library. To enable it, the papaya plugin must be added to the ckan.plugins setting in CKAN's configuration file. To avoid enabling Papaya viewer for all zip files, configure ckan.views.default_views accordingly. No other configuration settings are currently needed. Benefits & Impact: By incorporating ckanext-papaya, CKAN instances which host medical imaging datasets can offer a streamlined, in-browser viewing experience for NIFTI and DICOM files. This eliminates the need for users to download and use external viewers, simplifying data exploration and improving accessibility to medical imaging data shared through CKAN. It lowers the barrier to entry to exploration of medical imaging data for users without requiring prior knowledge of the underlying datasets.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A dataset for a university chatbot typically consists of a collection of queries and their corresponding responses. These queries are typically questions or requests made by users interacting with the chatbot, and the responses are the answers or actions provided by the chatbot in return. The dataset is usually organized in JSON format for easy storage and retrieval. Here's a brief description of what you might find in such a dataset:
JSON Format: The dataset is structured using the JSON (JavaScript Object Notation) format, which is a lightweight data interchange format. JSON consists of key-value pairs and nested structures, making it easy to represent structured data like queries and responses.
Queries: Each entry in the dataset includes a "query" field. This field contains the user's input or question to the chatbot. These queries can cover a wide range of topics related to university life, such as admissions, course information, campus facilities, events, or general inquiries.
Responses: The "response" field contains the chatbot's reply or action in response to the user's query. This could be a straightforward textual response, a set of actions to be performed, or even links to relevant resources. The responses are designed to assist and provide information to the user.
Dataset is created manually & with the help of tech support.
For ML code : https://github.com/TusharPaul01/ChatBot-ML-LSTM-
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F11965067%2F77ea6e26955ae850bcfb4a030a7b1620%2FCapture1.JPG?generation=1704981309747046&alt=media" alt="">
Follow me for more such content & code. Thanks!
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AbstractThe H1B is an employment-based visa category for temporary foreign workers in the United States. Every year, the US immigration department receives over 200,000 petitions and selects 85,000 applications through a random process and the U.S. employer must submit a petition for an H1B visa to the US immigration department. This is the most common visa status applied to international students once they complete college or higher education and begin working in a full-time position. The project provides essential information on job titles, preferred regions of settlement, foreign applicants and employers' trends for H1B visa application. According to locations, employers, job titles and salary range make up most of the H1B petitions, so different visualization utilizing tools will be used in order to analyze and interpreted in relation to the trends of the H1B visa to provide a recommendation to the applicant. This report is the base of the project for Visualization of Complex Data class at the George Washington University, some examples in this project has an analysis for the different relevant variables (Case Status, Employer Name, SOC name, Job Title, Prevailing Wage, Worksite, and Latitude and Longitude information) from Kaggle and Office of Foreign Labor Certification(OFLC) in order to see the H1B visa changes in the past several decades. Keywords: H1B visa, Data Analysis, Visualization of Complex Data, HTML, JavaScript, CSS, Tableau, D3.jsDatasetThe dataset contains 10 columns and covers a total of 3 million records spanning from 2011-2016. The relevant columns in the dataset include case status, employer name, SOC name, jobe title, full time position, prevailing wage, year, worksite, and latitude and longitude information.Link to dataset: https://www.kaggle.com/nsharan/h-1b-visaLink to dataset(FY2017): https://www.foreignlaborcert.doleta.gov/performancedata.cfmRunning the codeOpen Index.htmlData ProcessingDoing some data preprocessing to transform the raw data into an understandable format.Find and combine any other external datasets to enrich the analysis such as dataset of FY2017.To make appropriated Visualizations, variables should be Developed and compiled into visualization programs.Draw a geo map and scatter plot to compare the fastest growth in fixed value and in percentages.Extract some aspects and analyze the changes in employers’ preference as well as forecasts for the future trends.VisualizationsCombo chart: this chart shows the overall volume of receipts and approvals rate.Scatter plot: scatter plot shows the beneficiary country of birth.Geo map: this map shows All States of H1B petitions filed.Line chart: this chart shows top10 states of H1B petitions filed. Pie chart: this chart shows comparison of Education level and occupations for petitions FY2011 vs FY2017.Tree map: tree map shows overall top employers who submit the greatest number of applications.Side-by-side bar chart: this chart shows overall comparison of Data Scientist and Data Analyst.Highlight table: this table shows mean wage of a Data Scientist and Data Analyst with case status certified.Bubble chart: this chart shows top10 companies for Data Scientist and Data Analyst.Related ResearchThe H-1B Visa Debate, Explained - Harvard Business Reviewhttps://hbr.org/2017/05/the-h-1b-visa-debate-explainedForeign Labor Certification Data Centerhttps://www.foreignlaborcert.doleta.govKey facts about the U.S. H-1B visa programhttp://www.pewresearch.org/fact-tank/2017/04/27/key-facts-about-the-u-s-h-1b-visa-program/H1B visa News and Updates from The Economic Timeshttps://economictimes.indiatimes.com/topic/H1B-visa/newsH-1B visa - Wikipediahttps://en.wikipedia.org/wiki/H-1B_visaKey FindingsFrom the analysis, the government is cutting down the number of approvals for H1B on 2017.In the past decade, due to the nature of demand for high-skilled workers, visa holders have clustered in STEM fields and come mostly from countries in Asia such as China and India.Technical Jobs fill up the majority of Top 10 Jobs among foreign workers such as Computer Systems Analyst and Software Developers.The employers located in the metro areas thrive to find foreign workforce who can fill the technical position that they have in their organization.States like California, New York, Washington, New Jersey, Massachusetts, Illinois, and Texas are the prime location for foreign workers and provide many job opportunities. Top Companies such Infosys, Tata, IBM India that submit most H1B Visa Applications are companies based in India associated with software and IT services.Data Scientist position has experienced an exponential growth in terms of H1B visa applications and jobs are clustered in West region with the highest number.Visualization utilizing programsHTML, JavaScript, CSS, D3.js, Google API, Python, R, and Tableau