Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT
End-to-end (E2E) testing is a software validation approach that simulates realistic user scenarios throughout the entire workflow of an application. In the context of web
applications, E2E testing involves two activities: Graphic User Interface (GUI) testing, which simulates user interactions with the web app’s GUI through web browsers, and performance testing, which evaluates system workload handling. Despite its recognized importance in delivering high-quality web applications, the availability of large-scale datasets featuring real-world E2E web tests remains limited, hindering research in the field.
To address this gap, we present E2EGit, a comprehensive dataset of non-trivial open-source web projects collected on GitHub that adopt E2E testing. By analyzing over 5,000 web repositories across popular programming languages (JAVA, JAVASCRIPT, TYPESCRIPT, and PYTHON), we identified 472 repositories implementing 43,670 automated Web GUI tests with popular browser automation frameworks (SELENIUM, PLAYWRIGHT, CYPRESS, PUPPETEER), and 84 repositories that featured 271 automated performance tests implemented leveraging the most popular open-source tools (JMETER, LOCUST). Among these, 13 repositories implemented both types of testing for a total of 786 Web GUI tests and 61 performance tests.
DATASET DESCRIPTION
The dataset is provided as an SQLite database, whose structure is illustrated in Figure 3 (in the paper), which consists of five tables, each serving a specific purpose.
The repository table contains information on 1.5 million repositories collected using the SEART tool on May 4. It includes 34 fields detailing repository characteristics. The
non_trivial_repository table is a subset of the previous one, listing repositories that passed the two filtering stages described in the pipeline. For each repository, it specifies whether it is a web repository using JAVA, JAVASCRIPT, TYPESCRIPT, or PYTHON frameworks. A repository may use multiple frameworks, with corresponding fields (e.g., is web java) set to true, and the field web dependencies listing the detected web frameworks. For Web GUI testing, the dataset includes two additional tables; gui_testing_test _details, where each row represents a test file, providing the file path, the browser automation framework used, the test engine employed, and the number of tests implemented in the file. gui_testing_repo_details, aggregating data from the previous table at the repository level. Each of the 472 repositories has a row summarizing
the number of test files using frameworks like SELENIUM or PLAYWRIGHT, test engines like JUNIT, and the total number of tests identified. For performance testing, the performance_testing_test_details table contains 410 rows, one for each test identified. Each row includes the file path, whether the test uses JMETER or LOCUST, and extracted details such as the number of thread groups, concurrent users, and requests. Notably, some fields may be absent—for instance, if external files (e.g., CSVs defining workloads) were unavailable, or in the case of Locust tests, where parameters like duration and concurrent users are specified via the command line.
To cite this article refer to this citation:
@inproceedings{di2025e2egit,
title={E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects},
author={Di Meglio, Sergio and Starace, Luigi Libero Lucio and Pontillo, Valeria and Opdebeeck, Ruben and De Roover, Coen and Di Martino, Sergio},
booktitle={2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR)},
pages={10--15},
year={2025},
organization={IEEE/ACM}
}
This work has been partially supported by the Italian PNRR MUR project PE0000013-FAIR.
Facebook
TwitterCollected COVID-19 datasets from various sources as part of DAAN-888 course, Penn State, Spring 2022. Collaborators: Mohamed Abdelgayed, Heather Beckwith, Mayank Sharma, Suradech Kongkiatpaiboon, and Alex Stroud
**1 - COVID-19 Data in the United States ** Source: The data is collected from multiple public health official sources by NY Times journalists and compiled in one single file. Description: Daily count of new COVID-19 cases and deaths for each state. Data is updated daily and runs from 1/21/2020 to 2/4/2022. URL: https://github.com/nytimes/covid-19-data/blob/master/us-states.csv Data size: 38,814 row and 5 columns.
**2 - Mask-Wearing Survey Data ** Source: The New York Times is releasing estimates of mask usage by county in the United States. Description: This data comes from a large number of interviews conducted online by the global data and survey firm Dynata, at the request of The New York Times. The firm asked a question about mask usage to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. URL: https://github.com/nytimes/covid-19-data/blob/master/mask-use/mask-use-by-county.csv Data size: 3,142 rows and 6 columns
**3a - Vaccine Data – Global **
Source: This data comes from the US Centers for Disease Control and Prevention (CDC), Our World in Data (OWiD) and the World Health Organization (WHO).
Description: Time series data of vaccine doses administered and the number of fully and partially vaccinated people by country. This data was last updated on February 3, 2022
URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/global_data/time_series_covid19_vaccine_global.csv
Data Size: 162,521 rows and 8 columns
**3b -Vaccine Data – United States **
Source: The data is comprised of individual State's public dashboards and data from the US Centers for Disease Control and Prevention (CDC).
Description: Time series data of the total vaccine doses shipped and administered by manufacturer, the dose number (first or second) by state. This data was last updated on February 3, 2022.
URL: https://github.com/govex/COVID-19/blob/master/data_tables/vaccine_data/us_data/time_series/vaccine_data_us_timeline.csv
Data Size: 141,503 rows and 13 columns
**4 - Testing Data **
Source: The data is comprised of individual State's public dashboards and data from the U.S. Department of Health & Human Services.
Description: Time series data of total tests administered by county and state. This data was last updated on January 25, 2022.
URL: https://github.com/govex/COVID-19/blob/master/data_tables/testing_data/county_time_series_covid19_US.csv
Data size: 322,154 rows and 8 columns
**5 – US State and Territorial Public Mask Mandates ** Source: Data from state and territory executive orders, administrative orders, resolutions, and proclamations is gathered from government websites and cataloged and coded by one coder using Microsoft Excel, with quality checking provided by one or more other coders. Description: US State and Territorial Public Mask Mandates from April 10, 2020 through August 15, 2021 by County by Day URL: https://data.cdc.gov/Policy-Surveillance/U-S-State-and-Territorial-Public-Mask-Mandates-Fro/62d6-pm5i Data Size: 1,593,869 rows and 10 columns
**6 – Case Counts & Transmission Level **
Source: This open-source dataset contains seven data items that describe community transmission levels across all counties. This dataset provides the same numbers used to show transmission maps on the COVID Data Tracker and contains reported daily transmission levels at the county level. The dataset is updated every day to include the most current day's data. The calculating procedures below are used to adjust the transmission level to low, moderate, considerable, or high.
Description: US State and County case counts and transmission level from 16-Aug-2021 to 03-Feb-2022
URL: https://data.cdc.gov/Public-Health-Surveillance/United-States-COVID-19-County-Level-of-Community-T/8396-v7yb
Data Size: 550,702 rows and 7 columns
**7 - World Cases & Vaccination Counts **
Source: This is an open-source dataset collected and maintained by Our World in Data. OWID provides research and data to help against the world’s largest problems.
Description: This dataset includes vaccinations, tests & positivity, hospital & ICU, confirmed cases, confirmed deaths, reproduction rate, policy responses and other variables of interest.
URL: https://github.com/owid/covid-19-data/tree/master/public/data
Data Size: 67 columns and 157,000 rows
**8 - COVID-19 Data in the European Union **
Source: This is an open-source dataset collected and maintained by ECDC. It is an EU agency aimed at strengthening Europe's defenses against infectious diseases.
Description: This dataset co...
Facebook
TwitterGeostrat Report – The Sequence Stratigraphy and Sandstone Play Fairways of the Late Jurassic Humber Group of the UK Central Graben This non-exclusive report was purchased by the NSTA from Geostrat as part of the Data Purchase tender process (TRN097012017) that was carried out during Q1 2017. The contents do not necessarily reflect the technical view of the NSTA but the report is being published in the interests of making additional sources of data and interpretation available for use by the wider industry and academic communities. The Geostrat report provides stratigraphic analyses and interpretations of data from the Late Jurassic to Early Cretaceous Humber Group across the UK Central Graben and includes a series of depositional sequence maps for eight stratigraphic intervals. Stratigraphic interpretations and tops from 189 wells (up to Release 91) are also included in the report. The outputs as published here include a full PDF report, ODM/IC .dat format sequence maps, and all stratigraphic tops (lithostratigraphy, ages, sequence stratigraphy) in .csv format for import into different interpretation platforms. In addition, the NSTA has undertaken to provide the well tops, stratigraphic interpretations and sequence maps in shapefile format that is intended to facilitate the integration of these data into projects and data storage systems held by individual organisations who are using non-ESRI ArcGIS GIS software. As part of this process, the Geostrat well names have been matched as far as possible to the NSTA well names from the NSTA Offshore Wells shapefile (as provided on the NSTA’s Open Data website) and the original polygon files have been incorporated into an ArcGIS project. All the files within the GIS folder of this delivery have been created by the NSTA. An ESRI ArcGIS version of this delivery, including geodatabases, layer files and map documents for well tops, stratigraphic interpretations and sequence maps is available on the NSTA’s Open Data website and is recommended for use with ArcGIS. All releases included in the Data Purchase tender process that have been made openly available are summarised in a mapping application available from the NSTA website. The application includes an area of interest outline for each of the products and an overview of which wellbores have been included in the products.
Facebook
Twitterhttps://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Open Source Intelligence (OSINT) Market was valued at USD 9.96 Billion in 2024 and is projected to reach USD 52.40 Billion by 2032, growing at a CAGR of 25.44% from 2026 to 2032.Increasing Threats from Cybercrime and Terrorism: The escalating sophistication and frequency of cybercrime and terrorism represent a primary driver for the OSINT market. As malicious actors, from state sponsored hackers to individual extremists, increasingly leverage the internet for planning, communication, and attacks, the need for proactive threat intelligence has never been more critical. OSINT provides a powerful solution by enabling organizations to monitor public forums, social media, and the dark web for early warning signs of malicious intent.
Facebook
Twitterhttps://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
| BASE YEAR | 2024 |
| HISTORICAL DATA | 2019 - 2023 |
| REGIONS COVERED | North America, Europe, APAC, South America, MEA |
| REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
| MARKET SIZE 2024 | 2.69(USD Billion) |
| MARKET SIZE 2025 | 2.92(USD Billion) |
| MARKET SIZE 2035 | 6.5(USD Billion) |
| SEGMENTS COVERED | Application, Deployment Type, Source Type, End Use, Regional |
| COUNTRIES COVERED | US, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA |
| KEY MARKET DYNAMICS | increased data availability, growing security concerns, rising demand for analytics, advancements in AI technology, regulatory compliance challenges |
| MARKET FORECAST UNITS | USD Billion |
| KEY COMPANIES PROFILED | Axciom, Dataminr, CrowdStrike, IntSights Cyber Intelligence, Recorded Future, Google, Palantir Technologies, Microsoft, Dark Owl, Hootsuite, Keyhole Software, Social Search, Clearview AI, IBM, Verint Systems, Fastly |
| MARKET FORECAST PERIOD | 2025 - 2035 |
| KEY MARKET OPPORTUNITIES | Increased demand for cybersecurity solutions, Enhanced data analytics capabilities, Growth in government intelligence applications, Rising interest in social media monitoring, Expansion of AI-driven OSINT tools |
| COMPOUND ANNUAL GROWTH RATE (CAGR) | 8.4% (2025 - 2035) |
Facebook
TwitterThe Carbon Storage Open Database is a collection of spatial data obtained from publicly available sources published by several NATCARB Partnerships and other organizations. The carbon storage open database was collected from open-source data on ArcREST servers and websites in 2018, 2019, 2021, and 2022. The original database was published on the former GeoCube, which is now EDX Spatial, in July 2020, and has since been updated with additional data resources from the Energy Data eXchange (EDX) and external public data resources. The shapefile geodatabase is available in total, and has also been split up into multiple databases based on the maps produced for EDX spatial. These are topical map categories that describe the type of data, and sometimes the region for which the data relates. The data is separated in case there is only a specific area or data type that is of interest for download. In addition to the geodatabases, this submission contains: 1. A ReadMe file describing the processing steps completed to collect and curate the data. 2. A data catalog of all feature layers within the database. Additional published resources are available that describe the work done to produce the geodatabase: Morkner, P., Bauer, J., Creason, C., Sabbatino, M., Wingo, P., Greenburg, R., Walker, S., Yeates, D., Rose, K. 2022. Distilling Data to Drive Carbon Storage Insights. Computers & Geosciences. https://doi.org/10.1016/j.cageo.2021.104945 Morkner, P., Bauer, J., Shay, J., Sabbatino, M., and Rose, K. An Updated Carbon Storage Open Database - Geospatial Data Aggregation to Support Scaling -Up Carbon Capture and Storage. United States: N. p., 2022. Web. https://www.osti.gov/biblio/1890730 Morkner, P., Rose, K., Bauer, J., Rowan, C., Barkhurst, A., Baker, D.V., Sabbatino, M., Bean, A., Creason, C.G., Wingo, P., and Greenburg, R. Tools for Data Collection, Curation, and Discovery to Support Carbon Sequestration Insights. United States: N. p., 2020. Web. https://www.osti.gov/biblio/1777195 Disclaimer: This project was funded by the United States Department of Energy, National Energy Technology Laboratory, in part, through a site support contract. Neither the United States Government nor any agency thereof, nor any of their employees, nor the support contractor, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
Facebook
TwitterPrivately owned public spaces, also known by the acronym POPS, are outdoor and indoor spaces provided for public enjoyment by private owners in exchange for bonus floor area or waivers, an incentive first introduced into New York City's zoning regulations in 1961. To find out more about POPS, visit the Department of City Planning's website at http://nyc.gov/pops. This database contains detailed information about each privately owned public space in New York City.
Data Source: Privately Owned Public Space Database (2018), owned and maintained by the New York City Department of City Planning and created in collaboration with Jerold S. Kayden and The Municipal Art Society of New York. All previously released versions of this data are available on the DCP Website: BYTES of the BIG APPLE. Current version: 25v2
Facebook
TwitterGeostrat Report – The Sequence Stratigraphy and Sandstone Play Fairways of the Late Jurassic Humber Group of the UK Central Graben
This non-exclusive report was purchased by the NSTA from Geostrat as part of the Data Purchase tender process (TRN097012017) that was carried out during Q1 2017. The contents do not necessarily reflect the technical view of the NSTA but the report is being published in the interests of making additional sources of data and interpretation available for use by the wider industry and academic communities.
The Geostrat report provides stratigraphic analyses and interpretations of data from the Late Jurassic to Early Cretaceous Humber Group across the UK Central Graben and includes a series of depositional sequence maps for eight stratigraphic intervals. Stratigraphic interpretations and tops from 189 wells (up to Release 91) are also included in the report.
The outputs as published here include a full PDF report, ODM/IC .dat format sequence maps, and all stratigraphic tops (lithostratigraphy, ages, sequence stratigraphy) in .csv format for import into different interpretation platforms.
In addition, the NSTA has undertaken to provide the well tops, stratigraphic interpretations and sequence maps in shapefile format that is intended to facilitate the integration of these data into projects and data storage systems held by individual organisations who are using non-ESRI ArcGIS GIS software. As part of this process, the Geostrat well names have been matched as far as possible to the NSTA well names from the NSTA Offshore Wells shapefile (as provided on the NSTA’s Open Data website) and the original polygon files have been incorporated into an ArcGIS project. All the files within the GIS folder of this delivery have been created by the NSTA.
An ESRI ArcGIS version of this delivery, including geodatabases, layer files and map documents for well tops, stratigraphic interpretations and sequence maps is available on the NSTA’s Open Data website and is recommended for use with ArcGIS. All releases included in the Data Purchase tender process that have been made openly available are summarised in a mapping application available from the NSTA website. The application includes an area of interest outline for each of the products and an overview of which wellbores have been included in the products.
Facebook
Twitterhttps://www.zionmarketresearch.com/privacy-policyhttps://www.zionmarketresearch.com/privacy-policy
The global open source intelligence market size was valued USD 7.74 billion in 2023 and is expected to increase to USD 42.08 billion by 2032 at a CAGR of 20.70%.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Background: Critical care units (CCUs) with wide use of various monitoring devices generate massive data. To utilize the valuable information of these devices; data are collected and stored using systems like Clinical Information System (CIS), Laboratory Information Management System (LIMS), etc. These systems are proprietary in nature, allow limited access to their database and have vendor specific clinical implementation. In this study we focus on developing an open source web-based meta-data repository for CCU representing stay of patient with relevant details.
Methods: After developing the web-based open source repository we analyzed prospective data from two sites for four months for data quality dimensions (completeness, timeliness, validity, accuracy and consistency), morbidity and clinical outcomes. We used a regression model to highlight the significance of practice variations linked with various quality indicators. Results: Data dictionary (DD) with 1447 fields (90.39% categorical and 9.6% text fields) is presented to cover clinical workflow of NICU. The overall quality of 1795 patient days data with respect to standard quality dimensions is 87%. The data exhibit 82% completeness, 97% accuracy, 91% timeliness and 94% validity in terms of representing CCU processes. The data scores only 67% in terms of consistency. Furthermore, quality indicator and practice variations are strongly correlated (p-value < 0.05).
Results: Data dictionary (DD) with 1555 fields (89.6% categorical and 11.4% text fields) is presented to cover clinical workflow of a CCU. The overall quality of 1795 patient days data with respect to standard quality dimensions is 87%. The data exhibit 82% completeness, 97% accuracy, 91% timeliness and 94% validity in terms of representing CCU processes. The data scores only 67% in terms of consistency. Furthermore, quality indicators and practice variations are strongly correlated (p-value < 0.05).
Conclusion: This study documents DD for standardized data collection in CCU. This provides robust data and insights for audit purposes and pathways for CCU to target practice improvements leading to specific quality improvements.
Facebook
Twitterhttps://www.fnfresearch.com/privacy-policyhttps://www.fnfresearch.com/privacy-policy
[205+ Pages Report] Global Open Source Intelligence Market is estimated to reach a value of USD 28.34 Billion in the year 2026 with a growth rate of 19.9% CAGR during 2021-2026
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/37935/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/37935/terms
This study provides an evidence-based understanding on etiological issues related to school shootings and rampage shootings. It created a national, open-source database that includes all publicly known shootings that resulted in at least one injury that occurred on K-12 school grounds between 1990 and 2016. The investigators sought to better understand the nature of the problem and clarify the types of shooting incidents occurring in schools, provide information on the characteristics of school shooters, and compare fatal shooting incidents to events where only injuries resulted to identify intervention points that could be exploited to reduce the harm caused by shootings. To accomplish these objectives, the investigators used quantitative multivariate and qualitative case studies research methods to document where and when school violence occurs, and highlight key incident and perpetrator level characteristics to help law enforcement and school administrators differentiate between the kinds of school shootings that exist, to further policy responses that are appropriate for individuals and communities.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Management of the COVID-19 pandemic has proven to be a significant challenge to policy makers. This is in large part due to uneven reporting and the absence of open-access visualization tools to present and analyze local trends as well as infer healthcare needs. Here we report the development of CovidCounties.org, an interactive web application that depicts daily disease trends at the level of US counties using time series plots and maps. This application is accompanied by a manually curated dataset that catalogs all major public policy actions made at the state-level, as well as technical validation of the primary data. Finally, the underlying code for the site is also provided as open source, enabling others to validate and learn from this work.
Methods Data related to state-wide implementation of social-distancing policies were manually curated by web search and independently reviewed by a second author; disagreements were rare and resolved by discussion. Government websites were prioritized as sources of truth where feasible; otherwise, news reports covering state-wide proclamations were used. All citations are captured in the data file.
Ground truth data used in the validation were manually curated from states’ Department of Public Health websites. Citations of the validation data are included in the data file.
To confirm global accessibility of covidcounties.org, we used dareboost.com to perform loading speed tests from 14 cities across the globe using three different devices: Google Chrome via desktop, iPhone 6s/7/8, and Samsung Galaxy S6.
Facebook
TwitterOpen Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset level. This is also referred to as the package in some CKAN documentation. This is the main table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.
Facebook
TwitterUnited States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
Facebook
TwitterWe describe a unique, web-based data visualization portal developed for use by researchers and public transit agencies investigating future shared-taxi fleet scenarios. Augmenting or even replacing fixed-route transit lines with automated, connected, shared taxi fleets may be a desirable alternative in less-densely developed areas. The MATSim agent-based transport microsimulation model is used to study scenarios including the status quo, dynamically-dispatched fleets with drivers, and fully autonomous fleets. This paper focuses on a data visualization portal which includes many interactive views such as agent (taxi) movements color-coded by number of passengers and trip request origins and destinations, changes in roadway and passenger volumes compared to a base case, and more. The agent-based simulation covers a 24 hour simulation period; analysts can hone in on specific times of day to examine, e.g. school pickup/drop-offs or commute trips connecting to rail stations. The tool is in operation for several small cities and rural regions in Germany and was successfully used as an outreach tool in public meetings. In addition, developers of the MATSim DRT extension found the visualizations particularly useful for debugging both the algorithms and the scenario definitions. The code is entirely open source and, while this specific study has a rather esoteric use case, the visualization platform has an extensible design that could be modified for other purposes.
Facebook
TwitterThis layer serves as the authoritative geographic data source for California's K-12 public school locations during the 2024-25 academic year. Schools are mapped as point locations and assigned coordinates based on the physical address of the school facility. The school records are enriched with additional demographic and performance variables from the California Department of Education's data collections. These data elements can be visualized and examined geographically to uncover patterns, solve problems and inform education policy decisions.
The schools in this file represent a subset of all records contained in the CDE's public school directory database. This subset is restricted to TK-12 public schools that were open in October 2024 to coincide with the official 2024-25 student enrollment counts collected on Fall Census Day in 2024 (first Wednesday in October). This layer also excludes nonpublic nonsectarian schools and district office schools.
The CDE's California School Directory provides school location other basic school characteristics found in the layer's attribute table. The school enrollment, demographic and program data are collected by the CDE through the California Longitudinal Achievement System (CALPADS) and can be accessed as publicly downloadable files from the Data & Statistics web page on the CDE website.
Schools are assigned X, Y coordinates using a quality controlled geocoding and validation process to optimize positional accuracy. Most schools are mapped to the school structure or centroid of the school property parcel and are individually verified using aerial imagery or assessor's parcels databases. Schools are assigned various geographic area values based on their mapped locations including state and federal legislative district identifiers and National Center for Education Statistics (NCES) locale codes.
Facebook
TwitterPhylogenetic information inferred from the study of homologous genes helps us to understand the evolution of genes and gene families, including the identification of ancestral gene duplication events as well as regions under positive or purifying selection within lineages. Gene family and orthogroup characterization enables the identification of syntenic blocks, which can then be visualized with various tools. Unfortunately, currently available tools display only an overview of syntenic regions as a whole, limited to the gene level, and none provide further details about structural changes within genes, such as the conservation of ancestral exon boundaries amongst multiple genomes. We present Aequatus, an open-source web-based tool that provides an in-depth view of gene structure across gene families, with various options to render and filter visualizations. It relies on precalculated alignment and gene feature information typically held in, but not limited to, the Ensembl Compara and Core databases. We also offer Aequatus.js, a reusable JavaScript module that fulfills the visualization aspects of Aequatus, available within the Galaxy web platform as a visualization plug-in, which can be used to visualize gene trees generated by the GeneSeqToFamily workflow.
Facebook
Twitter🇬🇧 United Kingdom English Geostrat Report – The Sequence Stratigraphy and Sandstone Play Fairways of the Late Jurassic Humber Group of the UK Central Graben This non-exclusive report was purchased by the NSTA from Geostrat as part of the Data Purchase tender process (TRN097012017) that was carried out during Q1 2017. The contents do not necessarily reflect the technical view of the NSTA but the report is being published in the interests of making additional sources of data and interpretation available for use by the wider industry and academic communities. The Geostrat report provides stratigraphic analyses and interpretations of data from the Late Jurassic to Early Cretaceous Humber Group across the UK Central Graben and includes a series of depositional sequence maps for eight stratigraphic intervals. Stratigraphic interpretations and tops from 189 wells (up to Release 91) are also included in the report. The outputs as published here include a full PDF report, ODM/IC .dat format sequence maps, and all stratigraphic tops (lithostratigraphy, ages, sequence stratigraphy) in .csv format for import into different interpretation platforms. In addition, the NSTA has undertaken to provide the well tops, stratigraphic interpretations and sequence maps in shapefile format that is intended to facilitate the integration of these data into projects and data storage systems held by individual organisations who are using non-ESRI ArcGIS GIS software. As part of this process, the Geostrat well names have been matched as far as possible to the NSTA well names from the NSTA Offshore Wells shapefile (as provided on the NSTA’s Open Data website) and the original polygon files have been incorporated into an ArcGIS project. All the files within the GIS folder of this delivery have been created by the NSTA. An ESRI ArcGIS version of this delivery, including geodatabases, layer files and map documents for well tops, stratigraphic interpretations and sequence maps is available on the NSTA’s Open Data website and is recommended for use with ArcGIS. All releases included in the Data Purchase tender process that have been made openly available are summarised in a mapping application available from the NSTA website. The application includes an area of interest outline for each of the products and an overview of which wellbores have been included in the products.
Facebook
TwitterThis is the dataset used in the respective research work. The abstract is available below.
If you want to cite this work, please use:
Georgia M. Kapitsaki, Maria Papoutsoglou, Daniel German and Lefteris Angelis, What do developers talk about open source software licensing?, to appear in the Proceedings of the Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2020.
Free and open source software has gained a lot of momentum in the industry and the research community. Open source licenses determine the rules, under which the open source software can be further used and distributed. Previous works have examined the usage of open source licenses in the framework of specific projects or online social coding platforms, examining developers specific licensing views for specific software. However, the questions practitioners ask about licenses and licensing as captured in Question and Answer websites also constitute an important aspect toward understanding practitioners general licenses and licensing concerns. In this paper, we investigate open source license discussions using data from the Software Engineering, Open Source and Law Stack Exchange sites that contain relevant data. We describe the process used for the data collection and analysis, and discuss the main results. Our results indicate that clarifications about specific licenses and specific license terms are required. The results can be useful for developers, educators and license authors.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT
End-to-end (E2E) testing is a software validation approach that simulates realistic user scenarios throughout the entire workflow of an application. In the context of web
applications, E2E testing involves two activities: Graphic User Interface (GUI) testing, which simulates user interactions with the web app’s GUI through web browsers, and performance testing, which evaluates system workload handling. Despite its recognized importance in delivering high-quality web applications, the availability of large-scale datasets featuring real-world E2E web tests remains limited, hindering research in the field.
To address this gap, we present E2EGit, a comprehensive dataset of non-trivial open-source web projects collected on GitHub that adopt E2E testing. By analyzing over 5,000 web repositories across popular programming languages (JAVA, JAVASCRIPT, TYPESCRIPT, and PYTHON), we identified 472 repositories implementing 43,670 automated Web GUI tests with popular browser automation frameworks (SELENIUM, PLAYWRIGHT, CYPRESS, PUPPETEER), and 84 repositories that featured 271 automated performance tests implemented leveraging the most popular open-source tools (JMETER, LOCUST). Among these, 13 repositories implemented both types of testing for a total of 786 Web GUI tests and 61 performance tests.
DATASET DESCRIPTION
The dataset is provided as an SQLite database, whose structure is illustrated in Figure 3 (in the paper), which consists of five tables, each serving a specific purpose.
The repository table contains information on 1.5 million repositories collected using the SEART tool on May 4. It includes 34 fields detailing repository characteristics. The
non_trivial_repository table is a subset of the previous one, listing repositories that passed the two filtering stages described in the pipeline. For each repository, it specifies whether it is a web repository using JAVA, JAVASCRIPT, TYPESCRIPT, or PYTHON frameworks. A repository may use multiple frameworks, with corresponding fields (e.g., is web java) set to true, and the field web dependencies listing the detected web frameworks. For Web GUI testing, the dataset includes two additional tables; gui_testing_test _details, where each row represents a test file, providing the file path, the browser automation framework used, the test engine employed, and the number of tests implemented in the file. gui_testing_repo_details, aggregating data from the previous table at the repository level. Each of the 472 repositories has a row summarizing
the number of test files using frameworks like SELENIUM or PLAYWRIGHT, test engines like JUNIT, and the total number of tests identified. For performance testing, the performance_testing_test_details table contains 410 rows, one for each test identified. Each row includes the file path, whether the test uses JMETER or LOCUST, and extracted details such as the number of thread groups, concurrent users, and requests. Notably, some fields may be absent—for instance, if external files (e.g., CSVs defining workloads) were unavailable, or in the case of Locust tests, where parameters like duration and concurrent users are specified via the command line.
To cite this article refer to this citation:
@inproceedings{di2025e2egit,
title={E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects},
author={Di Meglio, Sergio and Starace, Luigi Libero Lucio and Pontillo, Valeria and Opdebeeck, Ruben and De Roover, Coen and Di Martino, Sergio},
booktitle={2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR)},
pages={10--15},
year={2025},
organization={IEEE/ACM}
}
This work has been partially supported by the Italian PNRR MUR project PE0000013-FAIR.