34 datasets found
  1. h

    synthetic_text_to_sql

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gretel.ai, synthetic_text_to_sql [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Gretel.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Image generated by DALL-E. See prompt for more details

      synthetic_text_to_sql
    

    gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:

    105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.

  2. Z

    Sample Dataset - HR Subject Areas

    • data.niaid.nih.gov
    Updated Jan 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weber, Marc (2023). Sample Dataset - HR Subject Areas [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7447111
    Explore at:
    Dataset updated
    Jan 18, 2023
    Dataset authored and provided by
    Weber, Marc
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset created as part of the Master Thesis "Business Intelligence – Automation of Data Marts modeling and its data processing".

    Lucerne University of Applied Sciences and Arts

    Master of Science in Applied Information and Data Science (MScIDS)

    Autumn Semester 2022

    Change log Version 1.1:

    The following SQL scripts were added:

        Index
        Type
        Name
    
    
        1
        View
        pg.dictionary_table
    
    
        2
        View
        pg.dictionary_column
    
    
        3
        View
        pg.dictionary_relation
    
    
        4
        View
        pg.accesslayer_table
    
    
        5
        View
        pg.accesslayer_column
    
    
        6
        View
        pg.accesslayer_relation
    
    
        7
        View
        pg.accesslayer_fact_candidate
    
    
        8
        Stored Procedure
        pg.get_fact_candidate
    
    
        9
        Stored Procedure
        pg.get_dimension_candidate
    
    
        10
        Stored Procedure
        pg.get_columns
    

    Scripts are based on Microsoft SQL Server Version 2017 and compatible with a data warehouse built with Datavault Builder. Data warehouse objects scripts of the sample data warehouse are restricted and cannot be shared.

  3. Library Carpentry SQL Lesson - DOAJ Article Sample Database

    • zenodo.org
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Erdmann; Christopher Erdmann (2020). Library Carpentry SQL Lesson - DOAJ Article Sample Database [Dataset]. http://doi.org/10.5281/zenodo.2822005
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christopher Erdmann; Christopher Erdmann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample Library Carpentry SQL lesson database created from the Directory of Open Access Journals (DOAJ) data. The sample SQL database contains tables: articles, journals, languages, licences, and publishers. Previous version of the sample SQL database: Staiger, Christine (2016): LC-articles. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3409471.v3

  4. Data from: SQL Injection Attack Netflow

    • zenodo.org
    • portalcienciaytecnologia.jcyl.es
    • +1more
    Updated Sep 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ignacio Crespo; Ignacio Crespo; AdriƔn Campazas; AdriƔn Campazas (2022). SQL Injection Attack Netflow [Dataset]. http://doi.org/10.5281/zenodo.6907252
    Explore at:
    Dataset updated
    Sep 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ignacio Crespo; Ignacio Crespo; AdriƔn Campazas; AdriƔn Campazas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.

    NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.

    Datasets

    The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).

    The datasets contain both benign and malicious traffic. All collected datasets are balanced.

    The version of NetFlow used to build the datasets is 5.

    DatasetAimSamplesBenign-malicious
    traffic ratio
    D1Training400,00350%
    D2Test57,23950%

    Infrastructure and implementation

    Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.

    DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)

    Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).

    The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.

    The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.

    ParametersDescription
    '--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'Enumerate users, password hashes, privileges, roles, databases, tables and columns
    --level=5Increase the probability of a false positive identification
    --risk=3Increase the probability of extracting data
    --random-agentSelect the User-Agent randomly
    --batchNever ask for user input, use the default behavior
    --answers="follow=Y"Predefined answers to yes

    Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).

    The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24.
    The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.

    However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.

    To run the MySQL server we ran MariaDB version 10.4.12.
    Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.

  5. O

    NSText2SQL

    • opendatalab.com
    • huggingface.co
    zip
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). NSText2SQL [Dataset]. https://opendatalab.com/OpenDataLab/NSText2SQL
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 1, 2024
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs.

  6. d

    Current Population Survey (CPS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  7. duckdb-text2sql-25k

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MotherDuck, duckdb-text2sql-25k [Dataset]. https://huggingface.co/datasets/motherduckdb/duckdb-text2sql-25k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    MotherDuck Corporation
    Authors
    MotherDuck
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Summary

    The duckdb-text2sql-25k dataset contains 25,000 DuckDB text-2-sql pairs covering diverse aspects of DuckDB's SQL syntax. We synthesized this dataset using Mixtral 8x7B, based on DuckDB's v0.9.2 documentation and Spider schemas that were translated to DuckDB syntax and enriched with nested type columns. Each training sample consists of a natural language prompt, a corresponding (optional) schema, and a resulting query. Each pair furthermore has a category property… See the full description on the dataset page: https://huggingface.co/datasets/motherduckdb/duckdb-text2sql-25k.

  8. Australian Employee Salary/Wages DATAbase by detailed occupation, location...

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Ferrers; Australian Taxation Office (2023). Australian Employee Salary/Wages DATAbase by detailed occupation, location and year (2002-14); (plus Sole Traders) [Dataset]. http://doi.org/10.6084/m9.figshare.4522895.v5
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Richard Ferrers; Australian Taxation Office
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The ATO (Australian Tax Office) made a dataset openly available (see links) showing all the Australian Salary and Wages (2002, 2006, 2010, 2014) by detailed occupation (around 1,000) and over 100 SA4 regions. Sole Trader sales and earnings are also provided. This open data (csv) is now packaged into a database (*.sql) with 45 sample SQL queries (backupSQL[date]_public.txt).See more description at related Figshare #datavis record. Versions:V5: Following #datascience course, I have made main data (individual salary and wages) available as csv and Jupyter Notebook. Checksum matches #dataTotals. In 209,xxx rows.Also provided Jobs, and SA4(Locations) description files as csv. More details at: Where are jobs growing/shrinking? Figshare DOI: 4056282 (linked below). Noted 1% discrepancy ($6B) in 2010 wages total - to follow up.#dataTotals - Salary and WagesYearWorkers (M)Earnings ($B) 20028.528520069.4372201010.2481201410.3584#dataTotal - Sole TradersYearWorkers (M)Sales ($B)Earnings ($B)20020.9611320061.0881920101.11122620141.19630#links See ATO request for data at ideascale link below.See original csv open data set (CC-BY) at data.gov.au link below.This database was used to create maps of change in regional employment - see Figshare link below (m9.figshare.4056282).#packageThis file package contains a database (analysing the open data) in SQL package and sample SQL text, interrogating the DB. DB name: test. There are 20 queries relating to Salary and Wages.#analysisThe database was analysed and outputs provided on Nectar(.org.au) resources at: http://118.138.240.130.(offline)This is only resourced for max 1 year, from July 2016, so will expire in June 2017. Hence the filing here. The sample home page is provided here (and pdf), but not all the supporting files, which may be packaged and added later. Until then all files are available at the Nectar URL. Nectar URL now offline - server files attached as package (html_backup[date].zip), including php scripts, html, csv, jpegs.#installIMPORT: DB SQL dump e.g. test_2016-12-20.sql (14.8Mb)1.Started MAMP on OSX.1.1 Go to PhpMyAdmin2. New Database: 3. Import: Choose file: test_2016-12-20.sql -> Go (about 15-20 seconds on MacBookPro 16Gb, 2.3 Ghz i5)4. four tables appeared: jobTitles 3,208 rows | salaryWages 209,697 rows | soleTrader 97,209 rows | stateNames 9 rowsplus views e.g. deltahair, Industrycodes, states5. Run test query under **#; Sum of Salary by SA4 e.g. 101 $4.7B, 102 $6.9B#sampleSQLselect sa4,(select sum(count) from salaryWageswhere year = '2014' and sa4 = sw.sa4) as thisYr14,(select sum(count) from salaryWageswhere year = '2010' and sa4 = sw.sa4) as thisYr10,(select sum(count) from salaryWageswhere year = '2006' and sa4 = sw.sa4) as thisYr06,(select sum(count) from salaryWageswhere year = '2002' and sa4 = sw.sa4) as thisYr02from salaryWages swgroup by sa4order by sa4

  9. c

    SQL Server Monitoring Tools Market size will grow at a CAGR of 5.50% from...

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Updated Apr 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research (2025). SQL Server Monitoring Tools Market size will grow at a CAGR of 5.50% from 2023 to 2030! [Dataset]. https://www.cognitivemarketresearch.com/sql-server-monitoring-tools-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Apr 4, 2025
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    According to Cognitive Market Research, the global SQL Server Monitoring Tools market will be USD XX million in 2023 and will expand at a compound annual growth rate (CAGR) of 5.50% from 2023 to 2030.

    North America held the major market of more than 40% of the global revenue. It will grow at a compound annual growth rate (CAGR) of 3.7% from 2023 to 2030
    Europe SQL Server Monitoring Tools is projected to expand at a compound annual growth rate (CAGR) of 5.50% from 2023 to 2030, Europe accounted for a share of over 30% of the global
    Asia Pacific held the market of more than 23% of the global revenue and will grow at a compound annual growth rate (CAGR) of 7.5% from 2023 to 2030
    Latin America market has more than 5% of the global revenue . It will grow at a compound annual growth rate (CAGR) of 4.9% from 2023 to 2030.
    Middle East and Africa held the major market of more than 3% of the global revenue and will grow at a compound annual growth rate (CAGR) of 5.2% from 2023 to 2030
    The demand for SQL Server Monitoring Tools is rising due Increasing Complexity and Volume of Data to Provide Viable Market Output.
    Demand for Web remains higher in the SQL Server Monitoring Tools market.
    The consumer and retail category held the highest SQL Server Monitoring Tools market revenue share in 2023.
    

    Increasing Complexity and Volume of Data to Provide Viable Market Output

    In today's data-intensive market, enterprises must deal with massive data quantities, which strains SQL Server performance. To solve this difficulty, monitoring solutions have become essential for guaranteeing the proper operation and availability of crucial workloads. These technologies monitor database performance parameters in real-time, finding bottlenecks and optimizing queries to improve overall system efficiency. Organizations may reduce performance concerns, avoid downtime, and ensure database dependability by proactively monitoring SQL Server environments. As a result, SQL Server monitoring solutions play an important role in assisting businesses as they traverse the complexity of maintaining and extracting value from large amounts of information.

    Digital Transformation to Propel Market Growth
    

    The growing reliance on digital services and apps has increased the need for performance monitoring and uptime technologies. Maintaining consistent performance becomes critical as businesses rely more on digital platforms for operations, customer interactions, and data management. Real-time monitoring, optimization, and troubleshooting tools are critical for avoiding disruptions and downtime while providing a consistent user experience. This increased demand reflects a growing realization of the vital role that digital services play in modern operations, prompting organizations to invest in solutions that ensure the performance and availability of their digital infrastructure.

    Market Restraints of the SQL Server Monitoring Tools

    High Cost to Restrict Market Growth
    

    Monitoring tool adoption and maintenance costs can be prohibitively expensive for smaller enterprises. While these technologies are critical for guaranteeing optimal system performance, smaller companies' financial constraints may limit their use. The initial setup costs, recurring license fees, and the need for qualified personnel to manage and interpret monitoring data can all burden tight budgets. As a result, smaller firms may need to carefully consider cost-effective alternatives or alternate techniques to overcome these constraints while still providing important monitoring capabilities without jeopardizing their financial stability.

    Impact of COVID–19 on the SQL Server Monitoring Tools Market

    COVID-19 has a dual impact on the market for SQL Server Monitoring Tools. On the one hand, growing remote work highlighted the significance of robust database monitoring for dispersed systems, driving up demand. On the other hand, economic uncertainty prompted some enterprises to reconsider investments, influencing purchasing decisions. The requirement for efficient database management, particularly in remote operations, fostered market resilience. Adaptable tools to manage performance difficulties were critical, reflecting a market dynamic in which the pandemic increased the adoption of monitoring solutions while influencing decision-making based on economic restrictions. Introduction of SQL Server Monitoring Tools

    The SQL Serv...

  10. H

    American Community Survey (ACS)

    • dataverse.harvard.edu
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). American Community Survey (ACS) [Dataset]. http://doi.org/10.7910/DVN/DKI9L4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the american community survey (acs) with r and monetdb experimental. think of the american community survey (acs) as the united states' census for off-years - the ones that don't end in zero. every year, one percent of all americans respond, making it the largest complex sample administered by the u.s. government (the decennial census has a much broader reach, but since it attempts to contact 100% of the population, it's not a sur vey). the acs asks how people live and although the questionnaire only includes about three hundred questions on demography, income, insurance, it's often accurate at sub-state geographies and - depending how many years pooled - down to small counties. households are the sampling unit, and once a household gets selected for inclusion, all of its residents respond to the survey. this allows household-level data (like home ownership) to be collected more efficiently and lets researchers examine family structure. the census bureau runs and finances this behemoth, of course. the dow nloadable american community survey ships as two distinct household-level and person-level comma-separated value (.csv) files. merging the two just rectangulates the data, since each person in the person-file has exactly one matching record in the household-file. for analyses of small, smaller, and microscopic geographic areas, choose one-, three-, or fiv e-year pooled files. use as few pooled years as you can, unless you like sentences that start with, "over the period of 2006 - 2010, the average american ... [insert yer findings here]." rather than processing the acs public use microdata sample line-by-line, the r language brazenly reads everything into memory by default. to prevent overloading your computer, dr. thomas lumley wrote the sqlsurvey package principally to deal with t his ram-gobbling monster. if you're already familiar with syntax used for the survey package, be patient and read the sqlsurvey examples carefully when something doesn't behave as you expect it to - some sqlsurvey commands require a different structure (i.e. svyby gets called through svymean) and others might not exist anytime soon (like svyolr). gimme some good news: sqlsurvey uses ultra-fast monetdb (click here for speed tests), so follow the monetdb installation instructions before running this acs code. monetdb imports, writes, recodes data slowly, but reads it hyper-fast . a magnificent trade-off: data exploration typically requires you to think, send an analysis command, think some more, send another query, repeat. importation scripts (especially the ones i've already written for you) can be left running overnight sans hand-holding. the acs weights generalize to the whole united states population including individuals living in group quarters, but non-residential respondents get an abridged questionnaire, so most (not all) analysts exclude records with a relp variable of 16 or 17 right off the bat. this new github repository contains four scripts: 2005-2011 - download all microdata.R create the batch (.bat) file needed to initiate the monet database in the future download, unzip, and import each file for every year and size specified by the user create and save household- and merged/person-level replicate weight complex sample designs create a well-documented block of code to re-initiate the monet db server in the future fair warning: this full script takes a loooong time. run it friday afternoon, commune with nature for the weekend, and if you've got a fast processor and speedy internet connection, monday morning it should be ready for action. otherwise, either download only the years and sizes you need or - if you gotta have 'em all - run it, minimize it, and then don't disturb it for a week. 2011 single-year - analysis e xamples.R run the well-documented block of code to re-initiate the monetdb server load the r data file (.rda) containing the replicate weight designs for the single-year 2011 file perform the standard repertoire of analysis examples, only this time using sqlsurvey functions 2011 single-year - variable reco de example.R run the well-documented block of code to re-initiate the monetdb server copy the single-year 2011 table to maintain the pristine original add a new age category variable by hand add a new age category variable systematically re-create then save the sqlsurvey replicate weight complex sample design on this new table close everything, then load everything back up in a fresh instance of r replicate a few of the census statistics. no muss, no fuss replicate census estimates - 2011.R run the well-documented block of code to re-initiate the monetdb server load the r data file (.rda) containing the replicate weight designs for the single-year 2011 file match every nation wide statistic on the census bureau's estimates page, using sqlsurvey functions click here to view these four scripts for more detail about the american community survey (acs), visit: < ul> the us census...

  11. Z

    Source Code Archiving to the Rescue of Reproducible Deployment — Replication...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sample, Timothy (2024). Source Code Archiving to the Rescue of Reproducible Deployment — Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11243113
    Explore at:
    Dataset updated
    May 23, 2024
    Dataset provided by
    Sample, Timothy
    CourtĆØs, Ludovic
    Zacchiroli, Stefano
    Simon, Tournier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Replication package for the paper:

    Ludovic CourtĆØs, Timothy Sample, Simon Tournier, Stefano Zacchiroli.Source Code Archiving to the Rescue of Reproducible DeploymentACM REP'24, June 18-20, 2024, Rennes, Francehttps://doi.org/10.1145/3641525.3663622

    Generating the paper

    The paper can be generated using the following command:

    guix time-machine -C channels.scm
    -- shell -C -m manifest.scm
    -- make

    This uses GNU Guix to run make in the exact same computational environment used when preparing the paper. The computational environment is described by two files. The channels.scm file specifies the exact version of the Guix package collection to use. The manifest.scm file selects a subset of those packages to include in the environment.

    It may be possible to generate the paper without Guix. To do so, you will need the following software (on top of a Unix-like environment):

    GNU Make

    SQLite 3

    GNU AWK

    Rubber

    Graphviz

    TeXLive

    Structure

    data/ contains the data examined in the paper

    scripts/ contains dedicated code for the paper

    logs/ contains logs generated during certain computations

    Preservation of Guix

    Some of the claims in the paper come from analyzing the Preservation of Guix (PoG) database as published on January 26, 2024. This database is the result of years of monitoring the extent to which the source code referenced by Guix packages is archived. This monitoring has been carried out by Timothy Sample who occasionally publishes reports on his personal website: https://ngyro.com/pog-reports/latest/. The database included in this package (data/pog.sql) was downloaded from https://ngyro.com/pog-reports/2024-01-26/pog.db and then exported to SQL format. In addition to the SQL file, the database schema is also included in this package as data/schema.sql.

    The database itself is largely the result of scripts, but also of manual adjustments (where necessary or convenient). The scripts are available at https://git.ngyro.com/preservation-of-guix/, which is preserved in the Software Heritage archive as well: https://archive.softwareheritage.org/swh:1:snp:efba3456a4aff0bc25b271e128aa8340ae2bc816;origin=https://git.ngyro.com/preservation-of-guix. These scripts rely on the availability of source code in certain locations on the Internet, and therefore will not yield exactly the same result when run again.

    Analysis

    Here is an overview of how we use the PoG database in the paper. The exact way it is queried to produce graphs and tables for the paper is laid out in the Makefile.

    The pog-types.sql query gives the counts of each source type (e.g. ā€œgitā€ or ā€œtar-gzā€) for each commit covered by the database.

    The pog-status.sql query gives the archival status of the sources by commit. For each commit, it produces a count of how many sources are stored in the Software Heritage archive, missing from it, or unknown if stored or missing. The pog-status-total.sql query does the same thing but over all sources without sorting them into individual commits.

    The disarchive-ratio.sql query estimates the success rate of Disarchive disassembly.

    Finally, the swhid-ratio.sql query gives the proportion of sources for which the PoG database has an SWHID.

    Estimating missing sources

    The Preservation of Guix database only covers sources from a sample of commits to the Guix repository. This greatly simplifies the process of collecting the sources at the risk of missing a few. We estimate how many are missed by searching Guix’s Git history for Nix-style base-32 hashes. The result of this search is compared to the hashes in the PoG database.

    A naĆÆve search of Git history results in an over estimate due to Guix’s branch development model. We find hashes that were never exposed to users of ā€˜guix pull’. To work around this, we also approximate the history of commits available to ā€˜guix pull’. We do this by scraping push events from the guix-commits mailing list archives (data/guix-commits.mbox). Unfortunately, those archives are not quite complete. Missing history is reconstructed in the data/missing-links.txt file.

    This estimate requires a copy of the Guix Git repository (not included in this package). The repository can be obtained from GNU at https://git.savannah.gnu.org/git/guix.git or from the Software Heritage archive: https://archive.softwareheritage.org/swh:1:snp:9d7b8dcf5625c17e42d51357848baa226b70e4bb;origin=https://git.savannah.gnu.org/git/guix.git. Once obtained, its location must be specified in the Makefile.

    To generate the estimate, use:

    guix time-machine -C channels.scm
    -- shell -C -m manifest.scm
    -- make data/missing-sources.txt

    If not using Guix, you will need additional software beyond what is used to generate the paper:

    GNU Guile

    GNU Bash

    GNU Mailutils

    GNU Parallel

    Measuring link rot

    In order to measure link rot, we ran Guix Scheme scripts, i.e., scripts that exploit Guix as a Scheme library. The scripts depend on the state of world at the very specific moment when they ran. Hence, it is not possible to reproduce the exact same outputs. However, their tendency over the passing of time should be very similar. For running them, you need an installation of Guix. For instance,

    guix repl -q scripts/table-per-origin.scm

    When running these scripts for the paper, we tracked their output and saved it inside the logs directory.

  12. h

    base_dataset

    • huggingface.co
    • opendatalab.com
    Updated Nov 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeComplete (2023). base_dataset [Dataset]. https://huggingface.co/datasets/codecomplete/base_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2023
    Dataset authored and provided by
    CodeComplete
    Description

    This dataset is 10% repo sampled dataset for selected languages. We applied a repo sample rate of 10%. e.g. if sample rate is 10% then we take 10% of all repos for a given language but include all files inside the repo. This was generated using our codecomplete/training/completions/datagen ./launch.sh
    --dataset-name bigcode/starcoderdata
    --subset c,cpp,go,java,javascript,typescript,python,ruby,scala,sql
    --sample-rate 0.01
    --hf-token

  13. Data from: DACOS - Dataset

    • zenodo.org
    bin, txt, zip
    Updated Jan 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himesh Nandani; Mootez Saad; Tushar Sharma; Himesh Nandani; Mootez Saad; Tushar Sharma (2023). DACOS - Dataset [Dataset]. http://doi.org/10.5281/zenodo.7570428
    Explore at:
    txt, bin, zipAvailable download formats
    Dataset updated
    Jan 26, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Himesh Nandani; Mootez Saad; Tushar Sharma; Himesh Nandani; Mootez Saad; Tushar Sharma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DACOS - DAtaset of COde Smells

    The dataset offers annotated code snippets for three code smells— multifaceted abstraction, complex method, and long parameter list.

    In addition to a manually annotated dataset on potentially subjective snippets, we offer a larger set of snippets containing the snippets that are either definitely benign or smelly.

    The upload contains three files :

    1. DACOSMain.sql - This is the SQL file containing the main DACOS dataset.
    2. DACOSExtended.sql - This is the SQL file containing the Extended DACOS dataset.
    3. Files.zip - The zip file containing all the source code files.

    Required Software

    The dataset is created in MySQL. Hence a local or remote installation of MySQL is needed with privileges to create and modify schemas.

    Importing the Dataset

    The dataset is a self-contained SQL file. To import the dataset, run the following command:

    mysql -u username -p database_name < DACOSMain.sql
    mysql -u username -p database_name < DACOSExtended.sql

    Understanding the Datasets

    Both the datasets differ in architecture. The main dataset contains a table named annotations that contains every annotation collected from users. The sample table contains the samples presented to the user for annotation. The class_metrics and method_metrics contain the tables for class and method metrics respectively. These were used to filter samples that are likely to contain smells and hence can be shown to users.

    The extended dataset is created by selecting samples that are below or above the selected metric range for each smell. Hence, these samples are definitely smelly or benign. The extended version of the dataset does not contain a table for annotation since they were not presented to user. It instead has an 'entry' table where each sample is classified according to the smell it contains. The codes for identifying smells are as below:

    Conditionsmell Id
    Multifaceted Abstraction Present1
    Multifaceted Abstraction not detected4
    Long Parameter List Present2
    Long Parameter List Absent5
    Complex Method Present3
    Complex Method Absent6

  14. Purchase Order Data

    • data.ca.gov
    csv, docx, pdf
    Updated Oct 23, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of General Services (2019). Purchase Order Data [Dataset]. https://data.ca.gov/dataset/purchase-order-data
    Explore at:
    csv, pdf, docxAvailable download formats
    Dataset updated
    Oct 23, 2019
    Dataset authored and provided by
    California Department of General Services
    Description

    The State Contract and Procurement Registration System (SCPRS) was established in 2003, as a centralized database of information on State contracts and purchases over $5000. eSCPRS represents the data captured in the State's eProcurement (eP) system, Bidsync, as of March 16, 2009. The data provided is an extract from that system for fiscal years 2012-2013, 2013-2014, and 2014-2015

    Data Limitations:
    Some purchase orders have multiple UNSPSC numbers, however only first was used to identify the purchase order. Multiple UNSPSC numbers were included to provide additional data for a DGS special event however this affects the formatting of the file. The source system Bidsync is being deprecated and these issues will be resolved in the future as state systems transition to Fi$cal.

    Data Collection Methodology:

    The data collection process starts with a data file from eSCPRS that is scrubbed and standardized prior to being uploaded into a SQL Server database. There are four primary tables. The Supplier, Department and United Nations Standard Products and Services Code (UNSPSC) tables are reference tables. The Supplier and Department tables are updated and mapped to the appropriate numbering schema and naming conventions. The UNSPSC table is used to categorize line item information and requires no further manipulation. The Purchase Order table contains raw data that requires conversion to the correct data format and mapping to the corresponding data fields. A stacking method is applied to the table to eliminate blanks where needed. Extraneous characters are removed from fields. The four tables are joined together and queries are executed to update the final Purchase Order Dataset table. Once the scrubbing and standardization process is complete the data is then uploaded into the SQL Server database.

    Secondary/Related Resources:

  15. Opening the Valve on Pure Data Dataset

    • zenodo.org
    application/gzip, zip
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anisha Islam; Anisha Islam; Kalvin Eng; Kalvin Eng; Abram Hindle; Abram Hindle (2024). Opening the Valve on Pure Data Dataset [Dataset]. http://doi.org/10.5281/zenodo.10576757
    Explore at:
    zip, application/gzipAvailable download formats
    Dataset updated
    Feb 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anisha Islam; Anisha Islam; Kalvin Eng; Kalvin Eng; Abram Hindle; Abram Hindle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This page contains the i) SQLite database, and ii) scripts and instructions for the paper titled Opening the Valve on Pure-Data: Usage Patterns and Programming Practices of a Data-Flow Based Visual Programming Language.

    We have provided two main files in this link:

    1. dataset.tar.gz
    2. scripts_and_instructions.zip

    Additionally, the i) SQLite database, ii) scripts and instructions, and iii) mirrored repositories of the PD projects can also be found in the following link: https://archive.org/details/Opening_the_Valve_on_Pure_Data.

    The download instructions are as follows:

    1. Our dataset is available at this link and also at archive.org and at as a file titled dataset.tar.gz (~1.12GB). You can download the file and then you can unzip the database by running tar -xzf dataset.tar.gz.
    2. You can also find the scripts and instructions needed to use our database and replicate our work inside the scripts_and_instructions.zip (~116MB) file, which you can download from this link and also from the same archive.org link. After that, you can unzip the scripts_and_instructions.zip file by using the command: unzip scripts_and_instructions.zip.
    3. Finally, the mirrored PD repositories are available at archive.org. The file is titled pd_mirrored.tar.gz (~242.5GB). You can download the zipped folder of the mirrored repositories using the following command: wget -c https://archive.org/download/Opening_the_Valve_on_Pure_Data/pd_mirrored.tar.gz. After that, you can unzip the file using tar -xzf pd_mirrored.tar.gz.

    You can find a README.md file inside the unzipped directory titled scripts_and_instructions detailing the structure and usage of our dataset, along with some sample SQL queries and additional helper scripts for the database. Furthermore, we have provided instructions for replicating our work in the same README file.

  16. f

    Additional file 4 of The plasma peptides of Alzheimer’s disease

    • springernature.figshare.com
    xlsx
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angelique Florentinus-Mefailoski; Peter Bowden; Philip Scheltens; Joep Killestein; Charlotte Teunissen; John G. Marshall (2023). Additional file 4 of The plasma peptides of Alzheimer’s disease [Dataset]. http://doi.org/10.6084/m9.figshare.14870521.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    figshare
    Authors
    Angelique Florentinus-Mefailoski; Peter Bowden; Philip Scheltens; Joep Killestein; Charlotte Teunissen; John G. Marshall
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 4: Table S4. The STRING Function analysis with Chi Square (χ2) values greater than nine (9) at 1 degree of freedom.

  17. f

    Additional file 2 of The plasma peptides of sepsis

    • springernature.figshare.com
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thanusi Thavarajah; Claudia C. dos Santos; Arthur S. Slutsky; John C. Marshall; Pete Bowden; Alexander Romaschin; John G. Marshall (2023). Additional file 2 of The plasma peptides of sepsis [Dataset]. http://doi.org/10.6084/m9.figshare.12606817.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Thanusi Thavarajah; Claudia C. dos Santos; Arthur S. Slutsky; John C. Marshall; Pete Bowden; Alexander Romaschin; John G. Marshall
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 2: Table S2. The tryptic phosphopeptides with Chi square FDR q-values.

  18. I

    Illinois Coastal Zone Water Quality Database (ICoastalDB)

    • databank.illinois.edu
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elias Getahun; Atticus Zavelle; Laura Keefer (2025). Illinois Coastal Zone Water Quality Database (ICoastalDB) [Dataset]. http://doi.org/10.13012/B2IDB-7799136_V2
    Explore at:
    Dataset updated
    Apr 1, 2025
    Authors
    Elias Getahun; Atticus Zavelle; Laura Keefer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Illinois
    Dataset funded by
    Illinois Department of Natural Resources (IDNR) - Illinois Coastal Management Program (ICMP)
    Description

    ICoastalDB, which was developed using Microsoft structured query language (SQL) Server, consists of water quality and related data in the Illinois coastal zone that were collected by various organizations. The information in the dataset includes, but is not limited to, sample data type, method of data sampling, location, time and date of sampling and data units.

  19. Dataset for Stack overflow Manual Results about challenges in developing...

    • zenodo.org
    zip
    Updated Aug 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zehao Wang; Tse-Hsun (Peter) Chen; Haoxiang Zhang; Shaowei Wang; Zehao Wang; Tse-Hsun (Peter) Chen; Haoxiang Zhang; Shaowei Wang (2022). Dataset for Stack overflow Manual Results about challenges in developing Spark applications [Dataset]. http://doi.org/10.5281/zenodo.6977517
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zehao Wang; Tse-Hsun (Peter) Chen; Haoxiang Zhang; Shaowei Wang; Zehao Wang; Tse-Hsun (Peter) Chen; Haoxiang Zhang; Shaowei Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains Stack Overflow manual study results for the paper "An Empirical Study on the Challenges that Developers Encounter When Developing Apache Spark Applications".

    • the data folder contains the Stackoverflow Manual Results.csv file that is the manual analysis result for the Stack Overflow posts. The CSV file contains information on the classification of the data, the reasons and the number of views, etc.

    • the scripts folder contains the python and SQL files that are used for data collection and data analysis.
      • query_data.sql is used to collect data from the Stack Exchange website.
      • sample.py is used to sample data for the manual analysis in the paper.
      • common_issue.py is used to study the percentage of common issues in rq1.
      • popularity.py is used to calculate the average of normalized view counts in rq2.
      • popularity_difficulty.py is used to calculate the average of raw view counts and the median hours to receive an answer in rq2.
      • root_cuase.py is used to study the percentage of root causes in rq3.
  20. h

    SynSQL-2.5M

    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Indraneil Paul (2025). SynSQL-2.5M [Dataset]. https://huggingface.co/datasets/iNeil77/SynSQL-2.5M
    Explore at:
    Dataset updated
    May 31, 2025
    Authors
    Indraneil Paul
    Description

    A filtered version of the dataset at seeklhy/SynSQL-2.5M with the data transformed to Apache Arrow and filtered as follows:

    Only samples that could be joined with a valid or non-empty database schema have been retained. Only samples whose joined schema was parsable with a tree-sitter-sql parser withour errors have been retained.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gretel.ai, synthetic_text_to_sql [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql

synthetic_text_to_sql

gretelai/synthetic_text_to_sql

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Gretel.ai
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Image generated by DALL-E. See prompt for more details

  synthetic_text_to_sql

gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:

105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.

Search
Clear search
Close search
Google apps
Main menu