Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Image generated by DALL-E. See prompt for more details
synthetic_text_to_sql
gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:
105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct⦠See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset created as part of the Master Thesis "Business Intelligence ā Automation of Data Marts modeling and its data processing".
Lucerne University of Applied Sciences and Arts
Master of Science in Applied Information and Data Science (MScIDS)
Autumn Semester 2022
Change log Version 1.1:
The following SQL scripts were added:
Index
Type
Name
1
View
pg.dictionary_table
2
View
pg.dictionary_column
3
View
pg.dictionary_relation
4
View
pg.accesslayer_table
5
View
pg.accesslayer_column
6
View
pg.accesslayer_relation
7
View
pg.accesslayer_fact_candidate
8
Stored Procedure
pg.get_fact_candidate
9
Stored Procedure
pg.get_dimension_candidate
10
Stored Procedure
pg.get_columns
Scripts are based on Microsoft SQL Server Version 2017 and compatible with a data warehouse built with Datavault Builder. Data warehouse objects scripts of the sample data warehouse are restricted and cannot be shared.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sample Library Carpentry SQL lesson database created from the Directory of Open Access Journals (DOAJ) data. The sample SQL database contains tables: articles, journals, languages, licences, and publishers. Previous version of the sample SQL database: Staiger, Christine (2016): LC-articles. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3409471.v3
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.
NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.
Datasets
The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).
The datasets contain both benign and malicious traffic. All collected datasets are balanced.
The version of NetFlow used to build the datasets is 5.
Dataset | Aim | Samples | Benign-malicious traffic ratio |
---|---|---|---|
D1 | Training | 400,003 | 50% |
D2 | Test | 57,239 | 50% |
Infrastructure and implementation
Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.
DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)
Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).
The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.
The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.
Parameters | Description |
---|---|
'--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema' | Enumerate users, password hashes, privileges, roles, databases, tables and columns |
--level=5 | Increase the probability of a false positive identification |
--risk=3 | Increase the probability of extracting data |
--random-agent | Select the User-Agent randomly |
--batch | Never ask for user input, use the default behavior |
--answers="follow=Y" | Predefined answers to yes |
Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).
The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24.
The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.
However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.
To run the MySQL server we ran MariaDB version 10.4.12.
Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs.
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Summary
The duckdb-text2sql-25k dataset contains 25,000 DuckDB text-2-sql pairs covering diverse aspects of DuckDB's SQL syntax. We synthesized this dataset using Mixtral 8x7B, based on DuckDB's v0.9.2 documentation and Spider schemas that were translated to DuckDB syntax and enriched with nested type columns. Each training sample consists of a natural language prompt, a corresponding (optional) schema, and a resulting query. Each pair furthermore has a category property⦠See the full description on the dataset page: https://huggingface.co/datasets/motherduckdb/duckdb-text2sql-25k.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ATO (Australian Tax Office) made a dataset openly available (see links) showing all the Australian Salary and Wages (2002, 2006, 2010, 2014) by detailed occupation (around 1,000) and over 100 SA4 regions. Sole Trader sales and earnings are also provided. This open data (csv) is now packaged into a database (*.sql) with 45 sample SQL queries (backupSQL[date]_public.txt).See more description at related Figshare #datavis record. Versions:V5: Following #datascience course, I have made main data (individual salary and wages) available as csv and Jupyter Notebook. Checksum matches #dataTotals. In 209,xxx rows.Also provided Jobs, and SA4(Locations) description files as csv. More details at: Where are jobs growing/shrinking? Figshare DOI: 4056282 (linked below). Noted 1% discrepancy ($6B) in 2010 wages total - to follow up.#dataTotals - Salary and WagesYearWorkers (M)Earnings ($B) 20028.528520069.4372201010.2481201410.3584#dataTotal - Sole TradersYearWorkers (M)Sales ($B)Earnings ($B)20020.9611320061.0881920101.11122620141.19630#links See ATO request for data at ideascale link below.See original csv open data set (CC-BY) at data.gov.au link below.This database was used to create maps of change in regional employment - see Figshare link below (m9.figshare.4056282).#packageThis file package contains a database (analysing the open data) in SQL package and sample SQL text, interrogating the DB. DB name: test. There are 20 queries relating to Salary and Wages.#analysisThe database was analysed and outputs provided on Nectar(.org.au) resources at: http://118.138.240.130.(offline)This is only resourced for max 1 year, from July 2016, so will expire in June 2017. Hence the filing here. The sample home page is provided here (and pdf), but not all the supporting files, which may be packaged and added later. Until then all files are available at the Nectar URL. Nectar URL now offline - server files attached as package (html_backup[date].zip), including php scripts, html, csv, jpegs.#installIMPORT: DB SQL dump e.g. test_2016-12-20.sql (14.8Mb)1.Started MAMP on OSX.1.1 Go to PhpMyAdmin2. New Database: 3. Import: Choose file: test_2016-12-20.sql -> Go (about 15-20 seconds on MacBookPro 16Gb, 2.3 Ghz i5)4. four tables appeared: jobTitles 3,208 rows | salaryWages 209,697 rows | soleTrader 97,209 rows | stateNames 9 rowsplus views e.g. deltahair, Industrycodes, states5. Run test query under **#; Sum of Salary by SA4 e.g. 101 $4.7B, 102 $6.9B#sampleSQLselect sa4,(select sum(count) from salaryWageswhere year = '2014' and sa4 = sw.sa4) as thisYr14,(select sum(count) from salaryWageswhere year = '2010' and sa4 = sw.sa4) as thisYr10,(select sum(count) from salaryWageswhere year = '2006' and sa4 = sw.sa4) as thisYr06,(select sum(count) from salaryWageswhere year = '2002' and sa4 = sw.sa4) as thisYr02from salaryWages swgroup by sa4order by sa4
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
According to Cognitive Market Research, the global SQL Server Monitoring Tools market will be USD XX million in 2023 and will expand at a compound annual growth rate (CAGR) of 5.50% from 2023 to 2030.
North America held the major market of more than 40% of the global revenue. It will grow at a compound annual growth rate (CAGR) of 3.7% from 2023 to 2030
Europe SQL Server Monitoring Tools is projected to expand at a compound annual growth rate (CAGR) of 5.50% from 2023 to 2030, Europe accounted for a share of over 30% of the global
Asia Pacific held the market of more than 23% of the global revenue and will grow at a compound annual growth rate (CAGR) of 7.5% from 2023 to 2030
Latin America market has more than 5% of the global revenue . It will grow at a compound annual growth rate (CAGR) of 4.9% from 2023 to 2030.
Middle East and Africa held the major market of more than 3% of the global revenue and will grow at a compound annual growth rate (CAGR) of 5.2% from 2023 to 2030
The demand for SQL Server Monitoring Tools is rising due Increasing Complexity and Volume of Data to Provide Viable Market Output.
Demand for Web remains higher in the SQL Server Monitoring Tools market.
The consumer and retail category held the highest SQL Server Monitoring Tools market revenue share in 2023.
Increasing Complexity and Volume of Data to Provide Viable Market Output
In today's data-intensive market, enterprises must deal with massive data quantities, which strains SQL Server performance. To solve this difficulty, monitoring solutions have become essential for guaranteeing the proper operation and availability of crucial workloads. These technologies monitor database performance parameters in real-time, finding bottlenecks and optimizing queries to improve overall system efficiency. Organizations may reduce performance concerns, avoid downtime, and ensure database dependability by proactively monitoring SQL Server environments. As a result, SQL Server monitoring solutions play an important role in assisting businesses as they traverse the complexity of maintaining and extracting value from large amounts of information.
Digital Transformation to Propel Market Growth
The growing reliance on digital services and apps has increased the need for performance monitoring and uptime technologies. Maintaining consistent performance becomes critical as businesses rely more on digital platforms for operations, customer interactions, and data management. Real-time monitoring, optimization, and troubleshooting tools are critical for avoiding disruptions and downtime while providing a consistent user experience. This increased demand reflects a growing realization of the vital role that digital services play in modern operations, prompting organizations to invest in solutions that ensure the performance and availability of their digital infrastructure.
Market Restraints of the SQL Server Monitoring Tools
High Cost to Restrict Market Growth
Monitoring tool adoption and maintenance costs can be prohibitively expensive for smaller enterprises. While these technologies are critical for guaranteeing optimal system performance, smaller companies' financial constraints may limit their use. The initial setup costs, recurring license fees, and the need for qualified personnel to manage and interpret monitoring data can all burden tight budgets. As a result, smaller firms may need to carefully consider cost-effective alternatives or alternate techniques to overcome these constraints while still providing important monitoring capabilities without jeopardizing their financial stability.
Impact of COVIDā19 on the SQL Server Monitoring Tools Market
COVID-19 has a dual impact on the market for SQL Server Monitoring Tools. On the one hand, growing remote work highlighted the significance of robust database monitoring for dispersed systems, driving up demand. On the other hand, economic uncertainty prompted some enterprises to reconsider investments, influencing purchasing decisions. The requirement for efficient database management, particularly in remote operations, fostered market resilience. Adaptable tools to manage performance difficulties were critical, reflecting a market dynamic in which the pandemic increased the adoption of monitoring solutions while influencing decision-making based on economic restrictions. Introduction of SQL Server Monitoring Tools
The SQL Serv...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the american community survey (acs) with r and monetdb experimental. think of the american community survey (acs) as the united states' census for off-years - the ones that don't end in zero. every year, one percent of all americans respond, making it the largest complex sample administered by the u.s. government (the decennial census has a much broader reach, but since it attempts to contact 100% of the population, it's not a sur vey). the acs asks how people live and although the questionnaire only includes about three hundred questions on demography, income, insurance, it's often accurate at sub-state geographies and - depending how many years pooled - down to small counties. households are the sampling unit, and once a household gets selected for inclusion, all of its residents respond to the survey. this allows household-level data (like home ownership) to be collected more efficiently and lets researchers examine family structure. the census bureau runs and finances this behemoth, of course. the dow nloadable american community survey ships as two distinct household-level and person-level comma-separated value (.csv) files. merging the two just rectangulates the data, since each person in the person-file has exactly one matching record in the household-file. for analyses of small, smaller, and microscopic geographic areas, choose one-, three-, or fiv e-year pooled files. use as few pooled years as you can, unless you like sentences that start with, "over the period of 2006 - 2010, the average american ... [insert yer findings here]." rather than processing the acs public use microdata sample line-by-line, the r language brazenly reads everything into memory by default. to prevent overloading your computer, dr. thomas lumley wrote the sqlsurvey package principally to deal with t his ram-gobbling monster. if you're already familiar with syntax used for the survey package, be patient and read the sqlsurvey examples carefully when something doesn't behave as you expect it to - some sqlsurvey commands require a different structure (i.e. svyby gets called through svymean) and others might not exist anytime soon (like svyolr). gimme some good news: sqlsurvey uses ultra-fast monetdb (click here for speed tests), so follow the monetdb installation instructions before running this acs code. monetdb imports, writes, recodes data slowly, but reads it hyper-fast . a magnificent trade-off: data exploration typically requires you to think, send an analysis command, think some more, send another query, repeat. importation scripts (especially the ones i've already written for you) can be left running overnight sans hand-holding. the acs weights generalize to the whole united states population including individuals living in group quarters, but non-residential respondents get an abridged questionnaire, so most (not all) analysts exclude records with a relp variable of 16 or 17 right off the bat. this new github repository contains four scripts: 2005-2011 - download all microdata.R create the batch (.bat) file needed to initiate the monet database in the future download, unzip, and import each file for every year and size specified by the user create and save household- and merged/person-level replicate weight complex sample designs create a well-documented block of code to re-initiate the monet db server in the future fair warning: this full script takes a loooong time. run it friday afternoon, commune with nature for the weekend, and if you've got a fast processor and speedy internet connection, monday morning it should be ready for action. otherwise, either download only the years and sizes you need or - if you gotta have 'em all - run it, minimize it, and then don't disturb it for a week. 2011 single-year - analysis e xamples.R run the well-documented block of code to re-initiate the monetdb server load the r data file (.rda) containing the replicate weight designs for the single-year 2011 file perform the standard repertoire of analysis examples, only this time using sqlsurvey functions 2011 single-year - variable reco de example.R run the well-documented block of code to re-initiate the monetdb server copy the single-year 2011 table to maintain the pristine original add a new age category variable by hand add a new age category variable systematically re-create then save the sqlsurvey replicate weight complex sample design on this new table close everything, then load everything back up in a fresh instance of r replicate a few of the census statistics. no muss, no fuss replicate census estimates - 2011.R run the well-documented block of code to re-initiate the monetdb server load the r data file (.rda) containing the replicate weight designs for the single-year 2011 file match every nation wide statistic on the census bureau's estimates page, using sqlsurvey functions click here to view these four scripts for more detail about the american community survey (acs), visit: < ul> the us census...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication package for the paper:
Ludovic CourtĆØs, Timothy Sample, Simon Tournier, Stefano Zacchiroli.Source Code Archiving to the Rescue of Reproducible DeploymentACM REP'24, June 18-20, 2024, Rennes, Francehttps://doi.org/10.1145/3641525.3663622
Generating the paper
The paper can be generated using the following command:
guix time-machine -C channels.scm
-- shell -C -m manifest.scm
-- make
This uses GNU Guix to run make in the exact same computational environment used when preparing the paper. The computational environment is described by two files. The channels.scm file specifies the exact version of the Guix package collection to use. The manifest.scm file selects a subset of those packages to include in the environment.
It may be possible to generate the paper without Guix. To do so, you will need the following software (on top of a Unix-like environment):
GNU Make
SQLite 3
GNU AWK
Rubber
Graphviz
TeXLive
Structure
data/ contains the data examined in the paper
scripts/ contains dedicated code for the paper
logs/ contains logs generated during certain computations
Preservation of Guix
Some of the claims in the paper come from analyzing the Preservation of Guix (PoG) database as published on January 26, 2024. This database is the result of years of monitoring the extent to which the source code referenced by Guix packages is archived. This monitoring has been carried out by Timothy Sample who occasionally publishes reports on his personal website: https://ngyro.com/pog-reports/latest/. The database included in this package (data/pog.sql) was downloaded from https://ngyro.com/pog-reports/2024-01-26/pog.db and then exported to SQL format. In addition to the SQL file, the database schema is also included in this package as data/schema.sql.
The database itself is largely the result of scripts, but also of manual adjustments (where necessary or convenient). The scripts are available at https://git.ngyro.com/preservation-of-guix/, which is preserved in the Software Heritage archive as well: https://archive.softwareheritage.org/swh:1:snp:efba3456a4aff0bc25b271e128aa8340ae2bc816;origin=https://git.ngyro.com/preservation-of-guix. These scripts rely on the availability of source code in certain locations on the Internet, and therefore will not yield exactly the same result when run again.
Analysis
Here is an overview of how we use the PoG database in the paper. The exact way it is queried to produce graphs and tables for the paper is laid out in the Makefile.
The pog-types.sql query gives the counts of each source type (e.g. āgitā or ātar-gzā) for each commit covered by the database.
The pog-status.sql query gives the archival status of the sources by commit. For each commit, it produces a count of how many sources are stored in the Software Heritage archive, missing from it, or unknown if stored or missing. The pog-status-total.sql query does the same thing but over all sources without sorting them into individual commits.
The disarchive-ratio.sql query estimates the success rate of Disarchive disassembly.
Finally, the swhid-ratio.sql query gives the proportion of sources for which the PoG database has an SWHID.
Estimating missing sources
The Preservation of Guix database only covers sources from a sample of commits to the Guix repository. This greatly simplifies the process of collecting the sources at the risk of missing a few. We estimate how many are missed by searching Guixās Git history for Nix-style base-32 hashes. The result of this search is compared to the hashes in the PoG database.
A naĆÆve search of Git history results in an over estimate due to Guixās branch development model. We find hashes that were never exposed to users of āguix pullā. To work around this, we also approximate the history of commits available to āguix pullā. We do this by scraping push events from the guix-commits mailing list archives (data/guix-commits.mbox). Unfortunately, those archives are not quite complete. Missing history is reconstructed in the data/missing-links.txt file.
This estimate requires a copy of the Guix Git repository (not included in this package). The repository can be obtained from GNU at https://git.savannah.gnu.org/git/guix.git or from the Software Heritage archive: https://archive.softwareheritage.org/swh:1:snp:9d7b8dcf5625c17e42d51357848baa226b70e4bb;origin=https://git.savannah.gnu.org/git/guix.git. Once obtained, its location must be specified in the Makefile.
To generate the estimate, use:
guix time-machine -C channels.scm
-- shell -C -m manifest.scm
-- make data/missing-sources.txt
If not using Guix, you will need additional software beyond what is used to generate the paper:
GNU Guile
GNU Bash
GNU Mailutils
GNU Parallel
Measuring link rot
In order to measure link rot, we ran Guix Scheme scripts, i.e., scripts that exploit Guix as a Scheme library. The scripts depend on the state of world at the very specific moment when they ran. Hence, it is not possible to reproduce the exact same outputs. However, their tendency over the passing of time should be very similar. For running them, you need an installation of Guix. For instance,
guix repl -q scripts/table-per-origin.scm
When running these scripts for the paper, we tracked their output and saved it inside the logs directory.
This dataset is 10% repo sampled dataset for selected languages. We applied a repo sample rate of 10%. e.g. if sample rate is 10% then we take 10% of all repos for a
given language but include all files inside the repo.
This was generated using our codecomplete/training/completions/datagen
./launch.sh
--dataset-name bigcode/starcoderdata
--subset c,cpp,go,java,javascript,typescript,python,ruby,scala,sql
--sample-rate 0.01
--hf-token
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DACOS - DAtaset of COde Smells
The dataset offers annotated code snippets for three code smellsā multifaceted abstraction, complex method, and long parameter list.
In addition to a manually annotated dataset on potentially subjective snippets, we offer a larger set of snippets containing the snippets that are either definitely benign or smelly.
The upload contains three files :
Required Software
The dataset is created in MySQL. Hence a local or remote installation of MySQL is needed with privileges to create and modify schemas.
Importing the Dataset
The dataset is a self-contained SQL file. To import the dataset, run the following command:
mysql -u username -p database_name < DACOSMain.sql
mysql -u username -p database_name < DACOSExtended.sql
Understanding the Datasets
Both the datasets differ in architecture. The main dataset contains a table named annotations that contains every annotation collected from users. The sample table contains the samples presented to the user for annotation. The class_metrics and method_metrics contain the tables for class and method metrics respectively. These were used to filter samples that are likely to contain smells and hence can be shown to users.
The extended dataset is created by selecting samples that are below or above the selected metric range for each smell. Hence, these samples are definitely smelly or benign. The extended version of the dataset does not contain a table for annotation since they were not presented to user. It instead has an 'entry' table where each sample is classified according to the smell it contains. The codes for identifying smells are as below:
Condition | smell Id |
---|---|
Multifaceted Abstraction Present | 1 |
Multifaceted Abstraction not detected | 4 |
Long Parameter List Present | 2 |
Long Parameter List Absent | 5 |
Complex Method Present | 3 |
Complex Method Absent | 6 |
The State Contract and Procurement Registration System (SCPRS) was established in 2003, as a centralized database of information on State contracts and purchases over $5000. eSCPRS represents the data captured in the State's eProcurement (eP) system, Bidsync, as of March 16, 2009. The data provided is an extract from that system for fiscal years 2012-2013, 2013-2014, and 2014-2015
Data Limitations:
Some purchase orders have multiple UNSPSC numbers, however only first was used to identify the purchase order. Multiple UNSPSC numbers were included to provide additional data for a DGS special event however this affects the formatting of the file. The source system Bidsync is being deprecated and these issues will be resolved in the future as state systems transition to Fi$cal.
Data Collection Methodology:
The data collection process starts with a data file from eSCPRS that is scrubbed and standardized prior to being uploaded into a SQL Server database. There are four primary tables. The Supplier, Department and United Nations Standard Products and Services Code (UNSPSC) tables are reference tables. The Supplier and Department tables are updated and mapped to the appropriate numbering schema and naming conventions. The UNSPSC table is used to categorize line item information and requires no further manipulation. The Purchase Order table contains raw data that requires conversion to the correct data format and mapping to the corresponding data fields. A stacking method is applied to the table to eliminate blanks where needed. Extraneous characters are removed from fields. The four tables are joined together and queries are executed to update the final Purchase Order Dataset table. Once the scrubbing and standardization process is complete the data is then uploaded into the SQL Server database.
Secondary/Related Resources:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This page contains the i) SQLite database, and ii) scripts and instructions for the paper titled Opening the Valve on Pure-Data: Usage Patterns and Programming Practices of a Data-Flow Based Visual Programming Language.
We have provided two main files in this link:
Additionally, the i) SQLite database, ii) scripts and instructions, and iii) mirrored repositories of the PD projects can also be found in the following link: https://archive.org/details/Opening_the_Valve_on_Pure_Data.
The download instructions are as follows:
tar -xzf dataset.tar.gz
.unzip scripts_and_instructions.zip
.wget -c https://archive.org/download/Opening_the_Valve_on_Pure_Data/pd_mirrored.tar.gz
. After that, you can unzip the file using tar -xzf pd_mirrored.tar.gz
.
You can find a README.md file inside the unzipped directory titled scripts_and_instructions detailing the structure and usage of our dataset, along with some sample SQL queries and additional helper scripts for the database. Furthermore, we have provided instructions for replicating our work in the same README file.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 4: Table S4. The STRING Function analysis with Chi Square (Ļ2) values greater than nine (9) at 1 degree of freedom.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2: Table S2. The tryptic phosphopeptides with Chi square FDR q-values.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ICoastalDB, which was developed using Microsoft structured query language (SQL) Server, consists of water quality and related data in the Illinois coastal zone that were collected by various organizations. The information in the dataset includes, but is not limited to, sample data type, method of data sampling, location, time and date of sampling and data units.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains Stack Overflow manual study results for the paper "An Empirical Study on the Challenges that Developers Encounter When Developing Apache Spark Applications".
A filtered version of the dataset at seeklhy/SynSQL-2.5M with the data transformed to Apache Arrow and filtered as follows:
Only samples that could be joined with a valid or non-empty database schema have been retained. Only samples whose joined schema was parsable with a tree-sitter-sql parser withour errors have been retained.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Image generated by DALL-E. See prompt for more details
synthetic_text_to_sql
gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:
105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct⦠See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.