34 datasets found

h
synthetic_text_to_sql
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gretel.ai, synthetic_text_to_sql [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Gretel.ai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Image generated by DALL-E. See prompt for more details

synthetic_text_to_sql

gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:

105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.
Z
Sample Dataset - HR Subject Areas
data.niaid.nih.gov
Updated Jan 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weber, Marc (2023). Sample Dataset - HR Subject Areas [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7447111
Explore at:
Dataset updated
Jan 18, 2023
Dataset authored and provided by
Weber, Marc
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset created as part of the Master Thesis "Business Intelligence – Automation of Data Marts modeling and its data processing".

Lucerne University of Applied Sciences and Arts

Master of Science in Applied Information and Data Science (MScIDS)

Autumn Semester 2022

Change log Version 1.1:

The following SQL scripts were added:

Index Type Name 1 View pg.dictionary_table 2 View pg.dictionary_column 3 View pg.dictionary_relation 4 View pg.accesslayer_table 5 View pg.accesslayer_column 6 View pg.accesslayer_relation 7 View pg.accesslayer_fact_candidate 8 Stored Procedure pg.get_fact_candidate 9 Stored Procedure pg.get_dimension_candidate 10 Stored Procedure pg.get_columns

Scripts are based on Microsoft SQL Server Version 2017 and compatible with a data warehouse built with Datavault Builder. Data warehouse objects scripts of the sample data warehouse are restricted and cannot be shared.
Library Carpentry SQL Lesson - DOAJ Article Sample Database
zenodo.org
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Erdmann; Christopher Erdmann (2020). Library Carpentry SQL Lesson - DOAJ Article Sample Database [Dataset]. http://doi.org/10.5281/zenodo.2822005
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2822005
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christopher Erdmann; Christopher Erdmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sample Library Carpentry SQL lesson database created from the Directory of Open Access Journals (DOAJ) data. The sample SQL database contains tables: articles, journals, languages, licences, and publishers. Previous version of the sample SQL database: Staiger, Christine (2016): LC-articles. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3409471.v3

Data from: SQL Injection Attack Netflow

zenodo.org
portalcienciaytecnologia.jcyl.es
+1more

Updated Sep 28, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Ignacio Crespo; Ignacio Crespo; Adrián Campazas; Adrián Campazas (2022). SQL Injection Attack Netflow [Dataset]. http://doi.org/10.5281/zenodo.6907252

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.6907252

Dataset updated

Sep 28, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Ignacio Crespo; Ignacio Crespo; Adrián Campazas; Adrián Campazas

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Introduction

This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.

NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.

Datasets

The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).

The datasets contain both benign and malicious traffic. All collected datasets are balanced.

The version of NetFlow used to build the datasets is 5.

Dataset	Aim	Samples	Benign-malicious traffic ratio
D1	Training	400,003	50%
D2	Test	57,239	50%

Infrastructure and implementation

Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.

DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)

Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).

The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.

The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.

Parameters	Description
'--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'	Enumerate users, password hashes, privileges, roles, databases, tables and columns
--level=5	Increase the probability of a false positive identification
--risk=3	Increase the probability of extracting data
--random-agent	Select the User-Agent randomly
--batch	Never ask for user input, use the default behavior
--answers="follow=Y"	Predefined answers to yes

Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).

The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24.
The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.

However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.

To run the MySQL server we ran MariaDB version 10.4.12.
Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.

O
NSText2SQL
opendatalab.com
huggingface.co
zip
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). NSText2SQL [Dataset]. https://opendatalab.com/OpenDataLab/NSText2SQL
Explore at:
zipAvailable download formats
Dataset updated
Jul 1, 2024
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs.
d
Current Population Survey (CPS)
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/AK4FDD
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
duckdb-text2sql-25k
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MotherDuck, duckdb-text2sql-25k [Dataset]. https://huggingface.co/datasets/motherduckdb/duckdb-text2sql-25k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
MotherDuck Corporation
Authors
MotherDuck
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Summary

The duckdb-text2sql-25k dataset contains 25,000 DuckDB text-2-sql pairs covering diverse aspects of DuckDB's SQL syntax. We synthesized this dataset using Mixtral 8x7B, based on DuckDB's v0.9.2 documentation and Spider schemas that were translated to DuckDB syntax and enriched with nested type columns. Each training sample consists of a natural language prompt, a corresponding (optional) schema, and a resulting query. Each pair furthermore has a category property… See the full description on the dataset page: https://huggingface.co/datasets/motherduckdb/duckdb-text2sql-25k.
Australian Employee Salary/Wages DATAbase by detailed occupation, location...
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Ferrers; Australian Taxation Office (2023). Australian Employee Salary/Wages DATAbase by detailed occupation, location and year (2002-14); (plus Sole Traders) [Dataset]. http://doi.org/10.6084/m9.figshare.4522895.v5
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4522895.v5
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Richard Ferrers; Australian Taxation Office
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The ATO (Australian Tax Office) made a dataset openly available (see links) showing all the Australian Salary and Wages (2002, 2006, 2010, 2014) by detailed occupation (around 1,000) and over 100 SA4 regions. Sole Trader sales and earnings are also provided. This open data (csv) is now packaged into a database (*.sql) with 45 sample SQL queries (backupSQL[date]_public.txt).See more description at related Figshare #datavis record. Versions:V5: Following #datascience course, I have made main data (individual salary and wages) available as csv and Jupyter Notebook. Checksum matches #dataTotals. In 209,xxx rows.Also provided Jobs, and SA4(Locations) description files as csv. More details at: Where are jobs growing/shrinking? Figshare DOI: 4056282 (linked below). Noted 1% discrepancy ($6B) in 2010 wages total - to follow up.#dataTotals - Salary and WagesYearWorkers (M)Earnings ($B) 20028.528520069.4372201010.2481201410.3584#dataTotal - Sole TradersYearWorkers (M)Sales ($B)Earnings ($B)20020.9611320061.0881920101.11122620141.19630#links See ATO request for data at ideascale link below.See original csv open data set (CC-BY) at data.gov.au link below.This database was used to create maps of change in regional employment - see Figshare link below (m9.figshare.4056282).#packageThis file package contains a database (analysing the open data) in SQL package and sample SQL text, interrogating the DB. DB name: test. There are 20 queries relating to Salary and Wages.#analysisThe database was analysed and outputs provided on Nectar(.org.au) resources at: http://118.138.240.130.(offline)This is only resourced for max 1 year, from July 2016, so will expire in June 2017. Hence the filing here. The sample home page is provided here (and pdf), but not all the supporting files, which may be packaged and added later. Until then all files are available at the Nectar URL. Nectar URL now offline - server files attached as package (html_backup[date].zip), including php scripts, html, csv, jpegs.#installIMPORT: DB SQL dump e.g. test_2016-12-20.sql (14.8Mb)1.Started MAMP on OSX.1.1 Go to PhpMyAdmin2. New Database: 3. Import: Choose file: test_2016-12-20.sql -> Go (about 15-20 seconds on MacBookPro 16Gb, 2.3 Ghz i5)4. four tables appeared: jobTitles 3,208 rows | salaryWages 209,697 rows | soleTrader 97,209 rows | stateNames 9 rowsplus views e.g. deltahair, Industrycodes, states5. Run test query under **#; Sum of Salary by SA4 e.g. 101 $4.7B, 102 $6.9B#sampleSQLselect sa4,(select sum(count) from salaryWageswhere year = '2014' and sa4 = sw.sa4) as thisYr14,(select sum(count) from salaryWageswhere year = '2010' and sa4 = sw.sa4) as thisYr10,(select sum(count) from salaryWageswhere year = '2006' and sa4 = sw.sa4) as thisYr06,(select sum(count) from salaryWageswhere year = '2002' and sa4 = sw.sa4) as thisYr02from salaryWages swgroup by sa4order by sa4
c
SQL Server Monitoring Tools Market size will grow at a CAGR of 5.50% from...
cognitivemarketresearch.com
pdf,excel,csv,ppt
Updated Apr 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research (2025). SQL Server Monitoring Tools Market size will grow at a CAGR of 5.50% from 2023 to 2030! [Dataset]. https://www.cognitivemarketresearch.com/sql-server-monitoring-tools-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Apr 4, 2025
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, the global SQL Server Monitoring Tools market will be USD XX million in 2023 and will expand at a compound annual growth rate (CAGR) of 5.50% from 2023 to 2030.

North America held the major market of more than 40% of the global revenue. It will grow at a compound annual growth rate (CAGR) of 3.7% from 2023 to 2030 Europe SQL Server Monitoring Tools is projected to expand at a compound annual growth rate (CAGR) of 5.50% from 2023 to 2030, Europe accounted for a share of over 30% of the global Asia Pacific held the market of more than 23% of the global revenue and will grow at a compound annual growth rate (CAGR) of 7.5% from 2023 to 2030 Latin America market has more than 5% of the global revenue . It will grow at a compound annual growth rate (CAGR) of 4.9% from 2023 to 2030. Middle East and Africa held the major market of more than 3% of the global revenue and will grow at a compound annual growth rate (CAGR) of 5.2% from 2023 to 2030 The demand for SQL Server Monitoring Tools is rising due Increasing Complexity and Volume of Data to Provide Viable Market Output. Demand for Web remains higher in the SQL Server Monitoring Tools market. The consumer and retail category held the highest SQL Server Monitoring Tools market revenue share in 2023.

Increasing Complexity and Volume of Data to Provide Viable Market Output

In today's data-intensive market, enterprises must deal with massive data quantities, which strains SQL Server performance. To solve this difficulty, monitoring solutions have become essential for guaranteeing the proper operation and availability of crucial workloads. These technologies monitor database performance parameters in real-time, finding bottlenecks and optimizing queries to improve overall system efficiency. Organizations may reduce performance concerns, avoid downtime, and ensure database dependability by proactively monitoring SQL Server environments. As a result, SQL Server monitoring solutions play an important role in assisting businesses as they traverse the complexity of maintaining and extracting value from large amounts of information.

Digital Transformation to Propel Market Growth

The growing reliance on digital services and apps has increased the need for performance monitoring and uptime technologies. Maintaining consistent performance becomes critical as businesses rely more on digital platforms for operations, customer interactions, and data management. Real-time monitoring, optimization, and troubleshooting tools are critical for avoiding disruptions and downtime while providing a consistent user experience. This increased demand reflects a growing realization of the vital role that digital services play in modern operations, prompting organizations to invest in solutions that ensure the performance and availability of their digital infrastructure.

Market Restraints of the SQL Server Monitoring Tools

High Cost to Restrict Market Growth

Monitoring tool adoption and maintenance costs can be prohibitively expensive for smaller enterprises. While these technologies are critical for guaranteeing optimal system performance, smaller companies' financial constraints may limit their use. The initial setup costs, recurring license fees, and the need for qualified personnel to manage and interpret monitoring data can all burden tight budgets. As a result, smaller firms may need to carefully consider cost-effective alternatives or alternate techniques to overcome these constraints while still providing important monitoring capabilities without jeopardizing their financial stability.

Impact of COVID–19 on the SQL Server Monitoring Tools Market

COVID-19 has a dual impact on the market for SQL Server Monitoring Tools. On the one hand, growing remote work highlighted the significance of robust database monitoring for dispersed systems, driving up demand. On the other hand, economic uncertainty prompted some enterprises to reconsider investments, influencing purchasing decisions. The requirement for efficient database management, particularly in remote operations, fostered market resilience. Adaptable tools to manage performance difficulties were critical, reflecting a market dynamic in which the pandemic increased the adoption of monitoring solutions while influencing decision-making based on economic restrictions. Introduction of SQL Server Monitoring Tools

The SQL Serv...
H
American Community Survey (ACS)
dataverse.harvard.edu
Updated May 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Damico (2013). American Community Survey (ACS) [Dataset]. http://doi.org/10.7910/DVN/DKI9L4
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/DKI9L4
Dataset updated
May 30, 2013
Dataset provided by
Harvard Dataverse
Authors
Anthony Damico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
analyze the american community survey (acs) with r and monetdb experimental. think of the american community survey (acs) as the united states' census for off-years - the ones that don't end in zero. every year, one percent of all americans respond, making it the largest complex sample administered by the u.s. government (the decennial census has a much broader reach, but since it attempts to contact 100% of the population, it's not a sur vey). the acs asks how people live and although the questionnaire only includes about three hundred questions on demography, income, insurance, it's often accurate at sub-state geographies and - depending how many years pooled - down to small counties. households are the sampling unit, and once a household gets selected for inclusion, all of its residents respond to the survey. this allows household-level data (like home ownership) to be collected more efficiently and lets researchers examine family structure. the census bureau runs and finances this behemoth, of course. the dow nloadable american community survey ships as two distinct household-level and person-level comma-separated value (.csv) files. merging the two just rectangulates the data, since each person in the person-file has exactly one matching record in the household-file. for analyses of small, smaller, and microscopic geographic areas, choose one-, three-, or fiv e-year pooled files. use as few pooled years as you can, unless you like sentences that start with, "over the period of 2006 - 2010, the average american ... [insert yer findings here]." rather than processing the acs public use microdata sample line-by-line, the r language brazenly reads everything into memory by default. to prevent overloading your computer, dr. thomas lumley wrote the sqlsurvey package principally to deal with t his ram-gobbling monster. if you're already familiar with syntax used for the survey package, be patient and read the sqlsurvey examples carefully when something doesn't behave as you expect it to - some sqlsurvey commands require a different structure (i.e. svyby gets called through svymean) and others might not exist anytime soon (like svyolr). gimme some good news: sqlsurvey uses ultra-fast monetdb (click here for speed tests), so follow the monetdb installation instructions before running this acs code. monetdb imports, writes, recodes data slowly, but reads it hyper-fast . a magnificent trade-off: data exploration typically requires you to think, send an analysis command, think some more, send another query, repeat. importation scripts (especially the ones i've already written for you) can be left running overnight sans hand-holding. the acs weights generalize to the whole united states population including individuals living in group quarters, but non-residential respondents get an abridged questionnaire, so most (not all) analysts exclude records with a relp variable of 16 or 17 right off the bat. this new github repository contains four scripts: 2005-2011 - download all microdata.R create the batch (.bat) file needed to initiate the monet database in the future download, unzip, and import each file for every year and size specified by the user create and save household- and merged/person-level replicate weight complex sample designs create a well-documented block of code to re-initiate the monet db server in the future fair warning: this full script takes a loooong time. run it friday afternoon, commune with nature for the weekend, and if you've got a fast processor and speedy internet connection, monday morning it should be ready for action. otherwise, either download only the years and sizes you need or - if you gotta have 'em all - run it, minimize it, and then don't disturb it for a week. 2011 single-year - analysis e xamples.R run the well-documented block of code to re-initiate the monetdb server load the r data file (.rda) containing the replicate weight designs for the single-year 2011 file perform the standard repertoire of analysis examples, only this time using sqlsurvey functions 2011 single-year - variable reco de example.R run the well-documented block of code to re-initiate the monetdb server copy the single-year 2011 table to maintain the pristine original add a new age category variable by hand add a new age category variable systematically re-create then save the sqlsurvey replicate weight complex sample design on this new table close everything, then load everything back up in a fresh instance of r replicate a few of the census statistics. no muss, no fuss replicate census estimates - 2011.R run the well-documented block of code to re-initiate the monetdb server load the r data file (.rda) containing the replicate weight designs for the single-year 2011 file match every nation wide statistic on the census bureau's estimates page, using sqlsurvey functions click here to view these four scripts for more detail about the american community survey (acs), visit: < ul> the us census...
Z
Source Code Archiving to the Rescue of Reproducible Deployment — Replication...
data.niaid.nih.gov
zenodo.org
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sample, Timothy (2024). Source Code Archiving to the Rescue of Reproducible Deployment — Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11243113
Explore at:
Dataset updated
May 23, 2024
Dataset provided by
Sample, Timothy
Courtès, Ludovic
Zacchiroli, Stefano
Simon, Tournier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication package for the paper:

Ludovic Courtès, Timothy Sample, Simon Tournier, Stefano Zacchiroli.Source Code Archiving to the Rescue of Reproducible DeploymentACM REP'24, June 18-20, 2024, Rennes, Francehttps://doi.org/10.1145/3641525.3663622

Generating the paper

The paper can be generated using the following command:

guix time-machine -C channels.scm
-- shell -C -m manifest.scm
-- make

This uses GNU Guix to run make in the exact same computational environment used when preparing the paper. The computational environment is described by two files. The channels.scm file specifies the exact version of the Guix package collection to use. The manifest.scm file selects a subset of those packages to include in the environment.

It may be possible to generate the paper without Guix. To do so, you will need the following software (on top of a Unix-like environment):

GNU Make

SQLite 3

GNU AWK

Rubber

Graphviz

TeXLive

Structure

data/ contains the data examined in the paper

scripts/ contains dedicated code for the paper

logs/ contains logs generated during certain computations

Preservation of Guix

Some of the claims in the paper come from analyzing the Preservation of Guix (PoG) database as published on January 26, 2024. This database is the result of years of monitoring the extent to which the source code referenced by Guix packages is archived. This monitoring has been carried out by Timothy Sample who occasionally publishes reports on his personal website: https://ngyro.com/pog-reports/latest/. The database included in this package (data/pog.sql) was downloaded from https://ngyro.com/pog-reports/2024-01-26/pog.db and then exported to SQL format. In addition to the SQL file, the database schema is also included in this package as data/schema.sql.

The database itself is largely the result of scripts, but also of manual adjustments (where necessary or convenient). The scripts are available at https://git.ngyro.com/preservation-of-guix/, which is preserved in the Software Heritage archive as well: https://archive.softwareheritage.org/swh:1:snp:efba3456a4aff0bc25b271e128aa8340ae2bc816;origin=https://git.ngyro.com/preservation-of-guix. These scripts rely on the availability of source code in certain locations on the Internet, and therefore will not yield exactly the same result when run again.

Analysis

Here is an overview of how we use the PoG database in the paper. The exact way it is queried to produce graphs and tables for the paper is laid out in the Makefile.

The pog-types.sql query gives the counts of each source type (e.g. “git” or “tar-gz”) for each commit covered by the database.

The pog-status.sql query gives the archival status of the sources by commit. For each commit, it produces a count of how many sources are stored in the Software Heritage archive, missing from it, or unknown if stored or missing. The pog-status-total.sql query does the same thing but over all sources without sorting them into individual commits.

The disarchive-ratio.sql query estimates the success rate of Disarchive disassembly.

Finally, the swhid-ratio.sql query gives the proportion of sources for which the PoG database has an SWHID.

Estimating missing sources

The Preservation of Guix database only covers sources from a sample of commits to the Guix repository. This greatly simplifies the process of collecting the sources at the risk of missing a few. We estimate how many are missed by searching Guix’s Git history for Nix-style base-32 hashes. The result of this search is compared to the hashes in the PoG database.

A naïve search of Git history results in an over estimate due to Guix’s branch development model. We find hashes that were never exposed to users of ‘guix pull’. To work around this, we also approximate the history of commits available to ‘guix pull’. We do this by scraping push events from the guix-commits mailing list archives (data/guix-commits.mbox). Unfortunately, those archives are not quite complete. Missing history is reconstructed in the data/missing-links.txt file.

This estimate requires a copy of the Guix Git repository (not included in this package). The repository can be obtained from GNU at https://git.savannah.gnu.org/git/guix.git or from the Software Heritage archive: https://archive.softwareheritage.org/swh:1:snp:9d7b8dcf5625c17e42d51357848baa226b70e4bb;origin=https://git.savannah.gnu.org/git/guix.git. Once obtained, its location must be specified in the Makefile.

To generate the estimate, use:

guix time-machine -C channels.scm
-- shell -C -m manifest.scm
-- make data/missing-sources.txt

If not using Guix, you will need additional software beyond what is used to generate the paper:

GNU Guile

GNU Bash

GNU Mailutils

GNU Parallel

Measuring link rot

In order to measure link rot, we ran Guix Scheme scripts, i.e., scripts that exploit Guix as a Scheme library. The scripts depend on the state of world at the very specific moment when they ran. Hence, it is not possible to reproduce the exact same outputs. However, their tendency over the passing of time should be very similar. For running them, you need an installation of Guix. For instance,

guix repl -q scripts/table-per-origin.scm

When running these scripts for the paper, we tracked their output and saved it inside the logs directory.
h
base_dataset
huggingface.co
opendatalab.com
Updated Nov 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeComplete (2023). base_dataset [Dataset]. https://huggingface.co/datasets/codecomplete/base_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2023
Dataset authored and provided by
CodeComplete
Description
This dataset is 10% repo sampled dataset for selected languages. We applied a repo sample rate of 10%. e.g. if sample rate is 10% then we take 10% of all repos for a given language but include all files inside the repo. This was generated using our codecomplete/training/completions/datagen ./launch.sh
--dataset-name bigcode/starcoderdata
--subset c,cpp,go,java,javascript,typescript,python,ruby,scala,sql
--sample-rate 0.01
--hf-token
Data from: DACOS - Dataset
zenodo.org
bin, txt, zip
Updated Jan 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himesh Nandani; Mootez Saad; Tushar Sharma; Himesh Nandani; Mootez Saad; Tushar Sharma (2023). DACOS - Dataset [Dataset]. http://doi.org/10.5281/zenodo.7570428
Explore at:
txt, bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7570428
Dataset updated
Jan 26, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Himesh Nandani; Mootez Saad; Tushar Sharma; Himesh Nandani; Mootez Saad; Tushar Sharma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DACOS - DAtaset of COde Smells

The dataset offers annotated code snippets for three code smells— multifaceted abstraction, complex method, and long parameter list.

In addition to a manually annotated dataset on potentially subjective snippets, we offer a larger set of snippets containing the snippets that are either definitely benign or smelly.

The upload contains three files :

DACOSMain.sql - This is the SQL file containing the main DACOS dataset.

DACOSExtended.sql - This is the SQL file containing the Extended DACOS dataset.

Files.zip - The zip file containing all the source code files.

Required Software

The dataset is created in MySQL. Hence a local or remote installation of MySQL is needed with privileges to create and modify schemas.

Importing the Dataset

The dataset is a self-contained SQL file. To import the dataset, run the following command:

mysql -u username -p database_name < DACOSMain.sql mysql -u username -p database_name < DACOSExtended.sql

Understanding the Datasets

Both the datasets differ in architecture. The main dataset contains a table named annotations that contains every annotation collected from users. The sample table contains the samples presented to the user for annotation. The class_metrics and method_metrics contain the tables for class and method metrics respectively. These were used to filter samples that are likely to contain smells and hence can be shown to users.

The extended dataset is created by selecting samples that are below or above the selected metric range for each smell. Hence, these samples are definitely smelly or benign. The extended version of the dataset does not contain a table for annotation since they were not presented to user. It instead has an 'entry' table where each sample is classified according to the smell it contains. The codes for identifying smells are as below:

Condition smell Id
Multifaceted Abstraction Present 1
Multifaceted Abstraction not detected 4
Long Parameter List Present 2
Long Parameter List Absent 5
Complex Method Present 3
Complex Method Absent 6
Purchase Order Data
data.ca.gov
csv, docx, pdf
Updated Oct 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of General Services (2019). Purchase Order Data [Dataset]. https://data.ca.gov/dataset/purchase-order-data
Explore at:
csv, pdf, docxAvailable download formats
Dataset updated
Oct 23, 2019
Dataset authored and provided by
California Department of General Services
Description
The State Contract and Procurement Registration System (SCPRS) was established in 2003, as a centralized database of information on State contracts and purchases over $5000. eSCPRS represents the data captured in the State's eProcurement (eP) system, Bidsync, as of March 16, 2009. The data provided is an extract from that system for fiscal years 2012-2013, 2013-2014, and 2014-2015

Data Limitations:
Some purchase orders have multiple UNSPSC numbers, however only first was used to identify the purchase order. Multiple UNSPSC numbers were included to provide additional data for a DGS special event however this affects the formatting of the file. The source system Bidsync is being deprecated and these issues will be resolved in the future as state systems transition to Fi$cal.

Data Collection Methodology:

The data collection process starts with a data file from eSCPRS that is scrubbed and standardized prior to being uploaded into a SQL Server database. There are four primary tables. The Supplier, Department and United Nations Standard Products and Services Code (UNSPSC) tables are reference tables. The Supplier and Department tables are updated and mapped to the appropriate numbering schema and naming conventions. The UNSPSC table is used to categorize line item information and requires no further manipulation. The Purchase Order table contains raw data that requires conversion to the correct data format and mapping to the corresponding data fields. A stacking method is applied to the table to eliminate blanks where needed. Extraneous characters are removed from fields. The four tables are joined together and queries are executed to update the final Purchase Order Dataset table. Once the scrubbing and standardization process is complete the data is then uploaded into the SQL Server database.

Secondary/Related Resources:

State Contract Manual (SCM) vol. 2 http://www.dgs.ca.gov/pd/Resources/publications/SCM2.aspx

State Contract Manual (SCM) vol. 3 http://www.dgs.ca.gov/pd/Resources/publications/SCM3.aspx

Buying Green http://www.dgs.ca.gov/buyinggreen/Home.aspx

United Nations Standard Products and Services Code, http://www.unspsc.org/
Opening the Valve on Pure Data Dataset
zenodo.org
application/gzip, zip
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anisha Islam; Anisha Islam; Kalvin Eng; Kalvin Eng; Abram Hindle; Abram Hindle (2024). Opening the Valve on Pure Data Dataset [Dataset]. http://doi.org/10.5281/zenodo.10576757
Explore at:
zip, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10576757
Dataset updated
Feb 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anisha Islam; Anisha Islam; Kalvin Eng; Kalvin Eng; Abram Hindle; Abram Hindle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This page contains the i) SQLite database, and ii) scripts and instructions for the paper titled Opening the Valve on Pure-Data: Usage Patterns and Programming Practices of a Data-Flow Based Visual Programming Language.

We have provided two main files in this link:

dataset.tar.gz

scripts_and_instructions.zip

Additionally, the i) SQLite database, ii) scripts and instructions, and iii) mirrored repositories of the PD projects can also be found in the following link: https://archive.org/details/Opening_the_Valve_on_Pure_Data.

The download instructions are as follows:

Our dataset is available at this link and also at archive.org and at

https://zenodo.org/records/10576757

as a file titled dataset.tar.gz (~1.12GB). You can download the file and then you can unzip the database by running tar -xzf dataset.tar.gz.

You can also find the scripts and instructions needed to use our database and replicate our work inside the scripts_and_instructions.zip (~116MB) file, which you can download from this link and also from the same archive.org link. After that, you can unzip the scripts_and_instructions.zip file by using the command: unzip scripts_and_instructions.zip.

Finally, the mirrored PD repositories are available at archive.org. The file is titled pd_mirrored.tar.gz (~242.5GB). You can download the zipped folder of the mirrored repositories using the following command: wget -c https://archive.org/download/Opening_the_Valve_on_Pure_Data/pd_mirrored.tar.gz. After that, you can unzip the file using tar -xzf pd_mirrored.tar.gz.

You can find a README.md file inside the unzipped directory titled scripts_and_instructions detailing the structure and usage of our dataset, along with some sample SQL queries and additional helper scripts for the database. Furthermore, we have provided instructions for replicating our work in the same README file.
f
Additional file 4 of The plasma peptides of Alzheimer’s disease
springernature.figshare.com
xlsx
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Angelique Florentinus-Mefailoski; Peter Bowden; Philip Scheltens; Joep Killestein; Charlotte Teunissen; John G. Marshall (2023). Additional file 4 of The plasma peptides of Alzheimer’s disease [Dataset]. http://doi.org/10.6084/m9.figshare.14870521.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14870521.v1
Dataset updated
Jun 10, 2023
Dataset provided by
figshare
Authors
Angelique Florentinus-Mefailoski; Peter Bowden; Philip Scheltens; Joep Killestein; Charlotte Teunissen; John G. Marshall
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 4: Table S4. The STRING Function analysis with Chi Square (χ2) values greater than nine (9) at 1 degree of freedom.
f
Additional file 2 of The plasma peptides of sepsis
springernature.figshare.com
xlsx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thanusi Thavarajah; Claudia C. dos Santos; Arthur S. Slutsky; John C. Marshall; Pete Bowden; Alexander Romaschin; John G. Marshall (2023). Additional file 2 of The plasma peptides of sepsis [Dataset]. http://doi.org/10.6084/m9.figshare.12606817.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12606817.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Thanusi Thavarajah; Claudia C. dos Santos; Arthur S. Slutsky; John C. Marshall; Pete Bowden; Alexander Romaschin; John G. Marshall
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 2: Table S2. The tryptic phosphopeptides with Chi square FDR q-values.
I
Illinois Coastal Zone Water Quality Database (ICoastalDB)
databank.illinois.edu
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elias Getahun; Atticus Zavelle; Laura Keefer (2025). Illinois Coastal Zone Water Quality Database (ICoastalDB) [Dataset]. http://doi.org/10.13012/B2IDB-7799136_V2
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-7799136_V2
Dataset updated
Apr 1, 2025
Authors
Elias Getahun; Atticus Zavelle; Laura Keefer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Illinois
Dataset funded by
Illinois Department of Natural Resources (IDNR) - Illinois Coastal Management Program (ICMP)
Description
ICoastalDB, which was developed using Microsoft structured query language (SQL) Server, consists of water quality and related data in the Illinois coastal zone that were collected by various organizations. The information in the dataset includes, but is not limited to, sample data type, method of data sampling, location, time and date of sampling and data units.
Dataset for Stack overflow Manual Results about challenges in developing...
zenodo.org
zip
Updated Aug 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zehao Wang; Tse-Hsun (Peter) Chen; Haoxiang Zhang; Shaowei Wang; Zehao Wang; Tse-Hsun (Peter) Chen; Haoxiang Zhang; Shaowei Wang (2022). Dataset for Stack overflow Manual Results about challenges in developing Spark applications [Dataset]. http://doi.org/10.5281/zenodo.6977517
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6977517
Dataset updated
Aug 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zehao Wang; Tse-Hsun (Peter) Chen; Haoxiang Zhang; Shaowei Wang; Zehao Wang; Tse-Hsun (Peter) Chen; Haoxiang Zhang; Shaowei Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains Stack Overflow manual study results for the paper "An Empirical Study on the Challenges that Developers Encounter When Developing Apache Spark Applications".

the data folder contains the Stackoverflow Manual Results.csv file that is the manual analysis result for the Stack Overflow posts. The CSV file contains information on the classification of the data, the reasons and the number of views, etc.

the scripts folder contains the python and SQL files that are used for data collection and data analysis.

query_data.sql is used to collect data from the Stack Exchange website.

sample.py is used to sample data for the manual analysis in the paper.

common_issue.py is used to study the percentage of common issues in rq1.

popularity.py is used to calculate the average of normalized view counts in rq2.

popularity_difficulty.py is used to calculate the average of raw view counts and the median hours to receive an answer in rq2.

root_cuase.py is used to study the percentage of root causes in rq3.
h
SynSQL-2.5M
huggingface.co
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Indraneil Paul (2025). SynSQL-2.5M [Dataset]. https://huggingface.co/datasets/iNeil77/SynSQL-2.5M
Explore at:
Dataset updated
May 31, 2025
Authors
Indraneil Paul
Description
A filtered version of the dataset at seeklhy/SynSQL-2.5M with the data transformed to Apache Arrow and filtered as follows:

Only samples that could be joined with a valid or non-empty database schema have been retained. Only samples whose joined schema was parsable with a tree-sitter-sql parser withour errors have been retained.

Condition	smell Id
Multifaceted Abstraction Present	1
Multifaceted Abstraction not detected	4
Long Parameter List Present	2
Long Parameter List Absent	5
Complex Method Present	3
Complex Method Absent	6

Facebook

Twitter

Click to copy link

Link copied

Cite

Gretel.ai, synthetic_text_to_sql [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql

synthetic_text_to_sql

gretelai/synthetic_text_to_sql

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset provided by

Gretel.ai

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Image generated by DALL-E. See prompt for more details

  synthetic_text_to_sql

gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:

105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.

Clear search

Close search

Google apps

Main menu

synthetic_text_to_sql

Sample Dataset - HR Subject Areas

Library Carpentry SQL Lesson - DOAJ Article Sample Database

Data from: SQL Injection Attack Netflow

NSText2SQL

Current Population Survey (CPS)

duckdb-text2sql-25k

Australian Employee Salary/Wages DATAbase by detailed occupation, location...

SQL Server Monitoring Tools Market size will grow at a CAGR of 5.50% from...

American Community Survey (ACS)

Source Code Archiving to the Rescue of Reproducible Deployment — Replication...

base_dataset

Data from: DACOS - Dataset

Purchase Order Data

Opening the Valve on Pure Data Dataset

Additional file 4 of The plasma peptides of Alzheimer’s disease

Additional file 2 of The plasma peptides of sepsis

Illinois Coastal Zone Water Quality Database (ICoastalDB)

Dataset for Stack overflow Manual Results about challenges in developing...

SynSQL-2.5M

synthetic_text_to_sql

gretelai/synthetic_text_to_sql