43 datasets found
  1. P

    WikiSQL Dataset

    • paperswithcode.com
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Zhong; Caiming Xiong; Richard Socher, WikiSQL Dataset [Dataset]. https://paperswithcode.com/dataset/wikisql
    Explore at:
    Authors
    Victor Zhong; Caiming Xiong; Richard Socher
    Description

    WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.

  2. Northwind and Chinook DataBase

    • kaggle.com
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RCURIOSO (2024). Northwind and Chinook DataBase [Dataset]. https://www.kaggle.com/datasets/rcurioso/northwind-and-chinook-database
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 19, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    RCURIOSO
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Northwind Database

    La base de datos Northwind es una base de datos de muestra creada originalmente por Microsoft y utilizada como base para sus tutoriales en una variedad de productos de bases de datos durante décadas. La base de datos de Northwind contiene datos de ventas de una empresa ficticia llamada "Northwind Traders", que importa y exporta alimentos especiales de todo el mundo. La base de datos Northwind es un excelente esquema tutorial para un ERP de pequeñas empresas, con clientes, pedidos, inventario, compras, proveedores, envíos, empleados y contabilidad de entrada única. Desde entonces, la base de datos Northwind ha sido trasladada a una variedad de bases de datos que no son de Microsoft, incluido PostgreSQL.

    El conjunto de datos de Northwind incluye datos de muestra para lo siguiente.

    • Proveedores: Proveedores y vendedores de Northwind
    • Clientes: Clientes que compran productos de Northwind
    • Empleados: detalles de los empleados de los comerciantes de Northwind
    • Productos: InformaciĂłn del producto
    • Transportistas: los detalles de los transportistas que envĂ­an los productos desde los comerciantes a los clientes finales.
    • Ă“rdenes y detalles de la orden: transacciones de Ăłrdenes de venta que tienen lugar entre los clientes y la empresa.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13411583%2Fa52a5bbc3d8842abfdfcfe608b7a8d25%2FNorthwind_E-R_Diagram.png?generation=1718785485874540&alt=media" alt="">

    Chinook DataBase

    Chinook es una base de datos de muestra disponible para SQL Server, Oracle, MySQL, etc. Se puede crear ejecutando un Ăşnico script SQL. La base de datos Chinook es una alternativa a la base de datos Northwind, siendo ideal para demostraciones y pruebas de herramientas ORM dirigidas a servidores de bases de datos Ăşnicos o mĂşltiples.

    El modelo de datos Chinook representa una tienda de medios digitales, que incluye tablas para artistas, álbumes, pistas multimedia, facturas y clientes.

    Los datos relacionados con los medios se crearon utilizando datos reales de una biblioteca de iTunes. La información de clientes y empleados se creó manualmente utilizando nombres ficticios, direcciones que se pueden ubicar en mapas de Google y otros datos bien formateados (teléfono, fax, correo electrónico, etc.). La información de ventas se genera automáticamente utilizando datos aleatorios durante un período de cuatro años.

    ¿Por qué el nombre Chinook? El nombre de esta base de datos de ejemplo se basó en la base de datos Northwind. Los chinooks son vientos en el interior oeste de América del Norte, donde las praderas canadienses y las grandes llanuras se encuentran con varias cadenas montañosas. Los chinooks son más frecuentes en el sur de Alberta en Canadá. Chinook es una buena opción de nombre para una base de datos que pretende ser una alternativa a Northwind.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13411583%2Fd856e0358e3a572d50f1aba5e171c1c6%2FChinook%20DataBase.png?generation=1718785749657445&alt=media" alt="">

  3. Purchase Order Data

    • data.ca.gov
    • catalog.data.gov
    csv, docx, pdf
    Updated Oct 23, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of General Services (2019). Purchase Order Data [Dataset]. https://data.ca.gov/dataset/purchase-order-data
    Explore at:
    docx, csv, pdfAvailable download formats
    Dataset updated
    Oct 23, 2019
    Dataset authored and provided by
    California Department of General Services
    Description

    The State Contract and Procurement Registration System (SCPRS) was established in 2003, as a centralized database of information on State contracts and purchases over $5000. eSCPRS represents the data captured in the State's eProcurement (eP) system, Bidsync, as of March 16, 2009. The data provided is an extract from that system for fiscal years 2012-2013, 2013-2014, and 2014-2015

    Data Limitations:
    Some purchase orders have multiple UNSPSC numbers, however only first was used to identify the purchase order. Multiple UNSPSC numbers were included to provide additional data for a DGS special event however this affects the formatting of the file. The source system Bidsync is being deprecated and these issues will be resolved in the future as state systems transition to Fi$cal.

    Data Collection Methodology:

    The data collection process starts with a data file from eSCPRS that is scrubbed and standardized prior to being uploaded into a SQL Server database. There are four primary tables. The Supplier, Department and United Nations Standard Products and Services Code (UNSPSC) tables are reference tables. The Supplier and Department tables are updated and mapped to the appropriate numbering schema and naming conventions. The UNSPSC table is used to categorize line item information and requires no further manipulation. The Purchase Order table contains raw data that requires conversion to the correct data format and mapping to the corresponding data fields. A stacking method is applied to the table to eliminate blanks where needed. Extraneous characters are removed from fields. The four tables are joined together and queries are executed to update the final Purchase Order Dataset table. Once the scrubbing and standardization process is complete the data is then uploaded into the SQL Server database.

    Secondary/Related Resources:

  4. (Sunset)đź“’ Meta Kaggle ported to MS SQL SERVER

    • kaggle.com
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2024). (Sunset)đź“’ Meta Kaggle ported to MS SQL SERVER [Dataset]. http://doi.org/10.34740/kaggle/dsv/7896543
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 20, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    BwandoWando
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    I've always wanted to explore Kaggle's Meta Kaggle dataset but I am more comfortable on using TSQL when it comes to writing (very) complex queries. Also, I tend to write queries faster when using SQL MANAGEMENT STUDIO, like 100x faster. So, I ported Kaggle's Meta Kaggle dataset into MS SQL SERVER 2022 database format, created a backup file, then uploaded it here.

    • MSSQL VERSION: SQL Server 2022
    • Collation: SQL_Latin1_General_CP1_CI_AS
    • Recovery model: simple

    Requirements

    • Download and install the SQL SERVER 2022 Developer edition here
    • Download the backup file
    • Restore the backup file into your local. If you havent done this before, it's easy and straightforward. Here is a guide.

    (QUOTED FROM THE ORIGINAL DATASET)

    Meta Kaggle

    Explore Kaggle's public data on competitions, datasets, kernels (code/ notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but they think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F2ad97bce7839d6e57674e7a82981ed23%2F2Egeb8R.png?generation=1688912953875842&alt=media" alt="">

    Notes

  5. a

    Coursera - Data Science Fundamentals with Python and SQL

    • academictorrents.com
    bittorrent
    Updated May 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    None (2021). Coursera - Data Science Fundamentals with Python and SQL [Dataset]. https://academictorrents.com/details/bdc0bb1499b1992a5488b4bbcfc9288c30793c08
    Explore at:
    bittorrent(807986536)Available download formats
    Dataset updated
    May 2, 2021
    Authors
    None
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    A BitTorrent file to download data with the title 'Coursera - Data Science Fundamentals with Python and SQL'

  6. wikisql

    • huggingface.co
    Updated Aug 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salesforce (2023). wikisql [Dataset]. https://huggingface.co/datasets/Salesforce/wikisql
    Explore at:
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    Salesforce Inchttp://salesforce.com/
    Authors
    Salesforce
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    A large crowd-sourced dataset for developing natural language interfaces for relational databases

  7. O*NET Database

    • onetcenter.org
    excel, mysql, oracle +2
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Center for O*NET Development (2025). O*NET Database [Dataset]. https://www.onetcenter.org/database.html
    Explore at:
    oracle, sql server, text, mysql, excelAvailable download formats
    Dataset updated
    May 20, 2025
    Dataset provided by
    Occupational Information Network
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Dataset funded by
    US Department of Labor, Employment and Training Administration
    Description

    The O*NET Database contains hundreds of standardized and occupation-specific descriptors on almost 1,000 occupations covering the entire U.S. economy. The database, which is available to the public at no cost, is continually updated by a multi-method data collection program. Sources of data include: job incumbents, occupational experts, occupational analysts, employer job postings, and customer/professional association input.

    Data content areas include:

    • Worker Characteristics (e.g., Abilities, Interests, Work Styles)
    • Worker Requirements (e.g., Education, Knowledge, Skills)
    • Experience Requirements (e.g., On-the-Job Training, Work Experience)
    • Occupational Requirements (e.g., Detailed Work Activities, Work Context)
    • Occupation-Specific Information (e.g., Job Titles, Tasks, Technology Skills)

  8. SQL Bike Stores

    • kaggle.com
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed ZRIRAK (2024). SQL Bike Stores [Dataset]. https://www.kaggle.com/datasets/mohamedzrirak/sql-bkestores
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mohamed ZRIRAK
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Download: SQL Query This SQL project is focused on analyzing sales data from a relational database to gain insights into customer behavior, store performance, product sales, and the effectiveness of sales representatives. By executing a series of complex SQL queries across multiple tables, the project aggregates key metrics, such as total units sold and total revenue, and links them with customer, store, product, and staff details.

    Key Objectives:

    Customer Analysis: Understand customer purchasing patterns by analyzing the total number of units and revenue generated per customer. Product and Category Insights: Evaluate product performance and its category’s impact on overall sales. Store Performance: Identify which stores generate the most revenue and handle the highest sales volume. Sales Representative Effectiveness: Assess the performance of sales representatives by linking sales data with each representative’s handled orders. Techniques Used:

    SQL Joins: The project integrates data from multiple tables, including orders, customers, order_items, products, categories, stores, and staffs, using INNER JOIN to merge information from related tables. Aggregation: SUM functions are used to compute total units sold and revenue generated by each order, providing valuable insights into sales performance. Grouping: Data is grouped by order ID, customer, product, store, and sales representative, ensuring accurate and summarized sales metrics. Use Cases:

    Business Decision-Making: The analysis can help businesses identify high-performing products and stores, optimize inventory, and evaluate the impact of sales teams. Market Segmentation: Segment customers based on geographic location (city/state) and identify patterns in purchasing behavior. Sales Strategy Optimization: Provide recommendations to improve sales strategies by analyzing product categories and sales rep performance.

  9. W

    London Development Database SQL Extract

    • cloud.csiss.gmu.edu
    • data.europa.eu
    • +1more
    bin, pdf
    Updated May 25, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Greater London Authority (GLA) (2018). London Development Database SQL Extract [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/london-development-database-sql-extract
    Explore at:
    pdf, binAvailable download formats
    Dataset updated
    May 25, 2018
    Dataset provided by
    Greater London Authority (GLA)
    License

    http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence

    Area covered
    London
    Description

    This is a copy of the London Development Database.

    This is the entire LDD database exported as a .sql.tar using pg_dump. For information on how to use this file and details of the database tables please refer to the document London Development database export.pdf

    The permissions data within this extract includes anything submitted to LDD by 23/05/2018. All data is provided by London’s planning authorities.

    An extract from the database can be downloaded from the London Datastore and data can be viewed on a map at https://maps.london.gov.uk/map/?ldd

  10. h

    sql-create-context

    • huggingface.co
    • opendatalab.com
    Updated Apr 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    brianm (2023). sql-create-context [Dataset]. https://huggingface.co/datasets/b-mc2/sql-create-context
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 21, 2023
    Authors
    brianm
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset builds from WikiSQL and Spider. There are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets. The CREATE TABLE statement can often be copy and pasted from different DBMS and provides table names, column… See the full description on the dataset page: https://huggingface.co/datasets/b-mc2/sql-create-context.

  11. f

    A user-friendly extract of the LibGen scimag metadata SQL dump on 2017-04-07...

    • figshare.com
    application/x-rar
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Himmelstein; Stephen McLaughlin (2023). A user-friendly extract of the LibGen scimag metadata SQL dump on 2017-04-07 [Dataset]. http://doi.org/10.6084/m9.figshare.5231245.v1
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Daniel Himmelstein; Stephen McLaughlin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the metadata for the LibGen scimag database of full-text scholarly documents. Each row of this dataset corresponds to a scholarly document in the LibGen scimag database, as identified by its DOI.scimag_dbbackup-2017-04-07.rar was downloaded from http://libgen.io/dbdumps/backup_archive/scimag_dbbackup-2017-04-07.rar. It's a compressed SQL dump of the LibGen scimag metadata database on 2017-04-07. This is the unmodified file downloaded from libgen.io. It encodes a single table named scimag.libgen-scimag-2017-04-07.tsv.xz contains a TSV version of the scimag table from scimag_dbbackup-2017-04-07.rar. It's more user-friendly because it provides access to the data without requiring MySQL, is UTF-8 encoded, and has null bytes removed.The code that downloaded and processed these datasets is at https://git.io/v7Uh4. Users should note that the TimeAdded column appears to store the modification rather than the creation date for each DOI. As discussed in https://doi.org/b9s5, this field should not be mistaken for the date of first upload to LibGen scimag.

  12. Repackaged Full ITIS Data Set (MS SQL Server)

    • zenodo.org
    application/gzip, zip
    Updated Oct 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Integrated Taxonomic Information System; Integrated Taxonomic Information System (2021). Repackaged Full ITIS Data Set (MS SQL Server) [Dataset]. http://doi.org/10.5281/zenodo.3833105
    Explore at:
    application/gzip, zipAvailable download formats
    Dataset updated
    Oct 20, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Integrated Taxonomic Information System; Integrated Taxonomic Information System
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Retrieved 18 May 2020, from the Integrated Taxonomic Information System (ITIS) (http://www.itis.gov). via https://www.itis.gov/downloads/itisMSSql.zip .

    The archive itisMSSql.zip was unzipped, and repackaged as individual gzipped files. The original zip file is included in this data publication.

    Files in this publication:

    1. itisMSSql.zip - file downloaded from https://www.itis.gov/downloads/itisMSSql.zip on 18 May 2020

    2. Files ending with .gz (e.g., taxonomic_units.gz, taxonomic_units.gz, synonym_links.gz) - repackaged, gzipped, content of itisMSSql.zip

  13. P

    BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation)...

    • paperswithcode.com
    Updated Jan 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jinyang Li; Binyuan Hui; Ge Qu; Jiaxi Yang; Binhua Li; Bowen Li; Bailin Wang; Bowen Qin; Rongyu Cao; Ruiying Geng; Nan Huo; Xuanhe Zhou; Chenhao Ma; Guoliang Li; Kevin C. C. Chang; Fei Huang; Reynold Cheng; Yongbin Li (2024). BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) Dataset [Dataset]. https://paperswithcode.com/dataset/bird-sql
    Explore at:
    Dataset updated
    Jan 5, 2024
    Authors
    Jinyang Li; Binyuan Hui; Ge Qu; Jiaxi Yang; Binhua Li; Bowen Li; Bailin Wang; Bowen Qin; Rongyu Cao; Ruiying Geng; Nan Huo; Xuanhe Zhou; Chenhao Ma; Guoliang Li; Kevin C. C. Chang; Fei Huang; Reynold Cheng; Yongbin Li
    Description

    BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) represents a pioneering, cross-domain dataset that examines the impact of extensive database contents on text-to-SQL parsing. BIRD contains over 12,751 unique question-SQL pairs and 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc.

  14. a

    [udemy] SQL Masterclass for Financial Analysis & Financial Reporting

    • academictorrents.com
    bittorrent
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irfan Sharif (2024). [udemy] SQL Masterclass for Financial Analysis & Financial Reporting [Dataset]. https://academictorrents.com/details/161282939abe2462e37cd8a59664043716a1a529
    Explore at:
    bittorrent(735121934)Available download formats
    Dataset updated
    Dec 26, 2024
    Dataset authored and provided by
    Irfan Sharif
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    Official Course URL: udemy.com/course/sql-for-financial-data-analysis/ Course Overview: Unlock the power of SQL for financial data analysis and reporting. This course is tailored for non-tech professionals who want to streamline their analytics and reporting capabilities. Learn to extract and process financial data, prepare detailed reports like Profit & Loss Statements and Balance Sheets, and calculate critical financial ratios through practical exercises. What You ll Learn: - SQL Basics: Master database querying techniques for financial data. - Report Preparation: Create Profit & Loss Statements, Balance Sheets, and Cash Flow Statements. - Key Analytics: Calculate and interpret profitability, efficiency, and liquidity ratios. - Database Skills: Gain hands-on experience without prior technical expertise. Course Benefits: - Practical Applications: Apply SQL to real-world financial scenarios. - Independent Reporting: Reduce reliance on system-generated reports. - Career Advancem

  15. d

    Current Population Survey (CPS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  16. Playlist2vec: Spotify Million Playlist Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jun 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piyush Papreja; Piyush Papreja (2021). Playlist2vec: Spotify Million Playlist Dataset [Dataset]. http://doi.org/10.5281/zenodo.5002584
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 22, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Piyush Papreja; Piyush Papreja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was created using Spotify developer API. It consists of user-created as well as Spotify-curated playlists.
    The dataset consists of 1 million playlists, 3 million unique tracks, 3 million unique albums, and 1.3 million artists.
    The data is stored in a SQL database, with the primary entities being songs, albums, artists, and playlists.
    Each of the aforementioned entities are represented by unique IDs (Spotify URI).
    Data is stored into following tables:

    • album
    • artist
    • track
    • playlist
    • track_artist1
    • track_playlist1

    album

    | id | name | uri |

    id: Album ID as provided by Spotify
    name: Album Name as provided by Spotify
    uri: Album URI as provided by Spotify


    artist

    | id | name | uri |

    id: Artist ID as provided by Spotify
    name: Artist Name as provided by Spotify
    uri: Artist URI as provided by Spotify


    track

    | id | name | duration | popularity | explicit | preview_url | uri | album_id |

    id: Track ID as provided by Spotify
    name: Track Name as provided by Spotify
    duration: Track Duration (in milliseconds) as provided by Spotify
    popularity: Track Popularity as provided by Spotify
    explicit: Whether the track has explicit lyrics or not. (true or false)
    preview_url: A link to a 30 second preview (MP3 format) of the track. Can be null
    uri: Track Uri as provided by Spotify
    album_id: Album Id to which the track belongs


    playlist

    | id | name | followers | uri | total_tracks |

    id: Playlist ID as provided by Spotify
    name: Playlist Name as provided by Spotify
    followers: Playlist Followers as provided by Spotify
    uri: Playlist Uri as provided by Spotify
    total_tracks: Total number of tracks in the playlist.

    track_artist1

    | track_id | artist_id |

    Track-Artist association table

    track_playlist1

    | track_id | playlist_id |

    Track-Playlist association table

    - - - - - SETUP - - - - -


    The data is in the form of a SQL dump. The download size is about 10 GB, and the database populated from it comes out to about 35GB.

    spotifydbdumpschemashare.sql contains the schema for the database (for reference):
    spotifydbdumpshare.sql is the actual data dump.


    Setup steps:
    1. Create database

    - - - - - PAPER - - - - -


    The description of this dataset can be found in the following paper:

    Papreja P., Venkateswara H., Panchanathan S. (2020) Representation, Exploration and Recommendation of Playlists. In: Cellier P., Driessens K. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Communications in Computer and Information Science, vol 1168. Springer, Cham

  17. e

    Geodatabase for the Baltimore Ecosystem Study Spatial Data

    • portal.edirepository.org
    • search.dataone.org
    application/vnd.rar
    Updated May 4, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jarlath O'Neal-Dunne; Morgan Grove (2012). Geodatabase for the Baltimore Ecosystem Study Spatial Data [Dataset]. http://doi.org/10.6073/pasta/377da686246f06554f7e517de596cd2b
    Explore at:
    application/vnd.rar(29574980 kilobyte)Available download formats
    Dataset updated
    May 4, 2012
    Dataset provided by
    EDI
    Authors
    Jarlath O'Neal-Dunne; Morgan Grove
    Time period covered
    Jan 1, 1999 - Jun 1, 2014
    Area covered
    Description

    The establishment of a BES Multi-User Geodatabase (BES-MUG) allows for the storage, management, and distribution of geospatial data associated with the Baltimore Ecosystem Study. At present, BES data is distributed over the internet via the BES website. While having geospatial data available for download is a vast improvement over having the data housed at individual research institutions, it still suffers from some limitations. BES-MUG overcomes these limitations; improving the quality of the geospatial data available to BES researches, thereby leading to more informed decision-making.

       BES-MUG builds on Environmental Systems Research Institute's (ESRI) ArcGIS and ArcSDE technology. ESRI was selected because its geospatial software offers robust capabilities. ArcGIS is implemented agency-wide within the USDA and is the predominant geospatial software package used by collaborating institutions.
    
    
       Commercially available enterprise database packages (DB2, Oracle, SQL) provide an efficient means to store, manage, and share large datasets. However, standard database capabilities are limited with respect to geographic datasets because they lack the ability to deal with complex spatial relationships. By using ESRI's ArcSDE (Spatial Database Engine) in conjunction with database software, geospatial data can be handled much more effectively through the implementation of the Geodatabase model. Through ArcSDE and the Geodatabase model the database's capabilities are expanded, allowing for multiuser editing, intelligent feature types, and the establishment of rules and relationships. ArcSDE also allows users to connect to the database using ArcGIS software without being burdened by the intricacies of the database itself.
    
    
       For an example of how BES-MUG will help improve the quality and timeless of BES geospatial data consider a census block group layer that is in need of updating. Rather than the researcher downloading the dataset, editing it, and resubmitting to through ORS, access rules will allow the authorized user to edit the dataset over the network. Established rules will ensure that the attribute and topological integrity is maintained, so that key fields are not left blank and that the block group boundaries stay within tract boundaries. Metadata will automatically be updated showing who edited the dataset and when they did in the event any questions arise.
    
    
       Currently, a functioning prototype Multi-User Database has been developed for BES at the University of Vermont Spatial Analysis Lab, using Arc SDE and IBM's DB2 Enterprise Database as a back end architecture. This database, which is currently only accessible to those on the UVM campus network, will shortly be migrated to a Linux server where it will be accessible for database connections over the Internet. Passwords can then be handed out to all interested researchers on the project, who will be able to make a database connection through the Geographic Information Systems software interface on their desktop computer. 
    
    
       This database will include a very large number of thematic layers. Those layers are currently divided into biophysical, socio-economic and imagery categories. Biophysical includes data on topography, soils, forest cover, habitat areas, hydrology and toxics. Socio-economics includes political and administrative boundaries, transportation and infrastructure networks, property data, census data, household survey data, parks, protected areas, land use/land cover, zoning, public health and historic land use change. Imagery includes a variety of aerial and satellite imagery.
    
    
       See the readme: http://96.56.36.108/geodatabase_SAL/readme.txt
    
    
       See the file listing: http://96.56.36.108/geodatabase_SAL/diroutput.txt
    
  18. h

    spider

    • huggingface.co
    • opendatalab.com
    Updated Dec 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    XLang NLP Lab (2021). spider [Dataset]. https://huggingface.co/datasets/xlangai/spider
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2021
    Dataset authored and provided by
    XLang NLP Lab
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for Spider

      Dataset Summary
    

    Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases.

      Supported Tasks and Leaderboards
    

    The leaderboard can be seen at https://yale-lily.github.io/spider

      Languages
    

    The text in the dataset is in English.

      Dataset Structure
    
    
    
    
    
      Data… See the full description on the dataset page: https://huggingface.co/datasets/xlangai/spider.
    
  19. d

    Health and Retirement Study (HRS)

    • search.dataone.org
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D

  20. e

    ASSEMBLY OF FRANCE METROPOLITAN OPEN STREET MAP: GEOPACKAGE AND SQL FORMAT

    • data.europa.eu
    plain text, zip
    Updated Aug 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DELETED DELETED (2023). ASSEMBLY OF FRANCE METROPOLITAN OPEN STREET MAP: GEOPACKAGE AND SQL FORMAT [Dataset]. https://data.europa.eu/data/datasets/60c46d63ec3bdcb9d526c776?locale=en
    Explore at:
    zip(1197185836), zip, zip(300551415), plain text(433)Available download formats
    Dataset updated
    Aug 22, 2023
    Dataset authored and provided by
    DELETED DELETED
    Area covered
    Metropolitan France, France
    Description

    Here you will find an assembly of the open street map in metropolitan france. The geopackage version also contains data from neighbouring countries (border regions except espagne). The the.qgz project allows the geopackage data to be opened with the busy style and hacking depending on the zoom level. video presenting this data gpkg and QGIS: https://www.youtube.com/watch?v=R6O9cMqVVvM&t=6s The version.sql is characterised by an additional attribute for each geometric entity: The INSEE code This data will be updated on a monthly basis.

    INSTRUCTIONS FOR DECLARING GPKG DATA: Download all files and rename as follows:

    OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_001.zip — > OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.001 OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_002.zip — > OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.002 OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_003.zip — > OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.003 OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_004.zip — > OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.004

    or if you know the batch back to create a.bat file containing this (or you rename the renowned file. txt as rename.bat):

    pushd “% ~ DP0” REN OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_001.zip OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.001 REN OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_002.zip OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.002 REN OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_003.zip OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.003 REN OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_004.zip OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.004

    and launch.bat by double clicking on it (the batch must be in the same place as the zip files)

    Then right-click on the OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_001.zip file and have it extracted to “OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_001\” with your pressure relief software. There is no need to click on 002, 003, 004. Opening file.001 opens all other parts of the archive

    For version.sql, the procedure is the same: rename OSM_SQL_FXX_PRDG_D000_ED214_001.zip to OSM_SQL_FXX_PRDG_D000_ED214.zip.001 OSM_SQL_FXX_PRDG_D000_ED214_002.zip to OSM_SQL_FXX_PRDG_D000_ED214.zip.002 OSM_SQL_FXX_PRDG_D000_ED214_003.zip to OSM_SQL_FXX_PRDG_D000_ED214.zip.003 Then carry out pressure relief

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Victor Zhong; Caiming Xiong; Richard Socher, WikiSQL Dataset [Dataset]. https://paperswithcode.com/dataset/wikisql

WikiSQL Dataset

Explore at:
Authors
Victor Zhong; Caiming Xiong; Richard Socher
Description

WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.

Search
Clear search
Close search
Google apps
Main menu