43 datasets found

P
WikiSQL Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Zhong; Caiming Xiong; Richard Socher, WikiSQL Dataset [Dataset]. https://paperswithcode.com/dataset/wikisql
Explore at:
Authors
Victor Zhong; Caiming Xiong; Richard Socher
Description
WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.
Northwind and Chinook DataBase
kaggle.com
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RCURIOSO (2024). Northwind and Chinook DataBase [Dataset]. https://www.kaggle.com/datasets/rcurioso/northwind-and-chinook-database
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
RCURIOSO
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Northwind Database

La base de datos Northwind es una base de datos de muestra creada originalmente por Microsoft y utilizada como base para sus tutoriales en una variedad de productos de bases de datos durante décadas. La base de datos de Northwind contiene datos de ventas de una empresa ficticia llamada "Northwind Traders", que importa y exporta alimentos especiales de todo el mundo. La base de datos Northwind es un excelente esquema tutorial para un ERP de pequeñas empresas, con clientes, pedidos, inventario, compras, proveedores, envíos, empleados y contabilidad de entrada única. Desde entonces, la base de datos Northwind ha sido trasladada a una variedad de bases de datos que no son de Microsoft, incluido PostgreSQL.

El conjunto de datos de Northwind incluye datos de muestra para lo siguiente.

Proveedores: Proveedores y vendedores de Northwind

Clientes: Clientes que compran productos de Northwind

Empleados: detalles de los empleados de los comerciantes de Northwind

Productos: Información del producto

Transportistas: los detalles de los transportistas que envían los productos desde los comerciantes a los clientes finales.

Órdenes y detalles de la orden: transacciones de órdenes de venta que tienen lugar entre los clientes y la empresa.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13411583%2Fa52a5bbc3d8842abfdfcfe608b7a8d25%2FNorthwind_E-R_Diagram.png?generation=1718785485874540&alt=media" alt="">

Chinook DataBase

Chinook es una base de datos de muestra disponible para SQL Server, Oracle, MySQL, etc. Se puede crear ejecutando un único script SQL. La base de datos Chinook es una alternativa a la base de datos Northwind, siendo ideal para demostraciones y pruebas de herramientas ORM dirigidas a servidores de bases de datos únicos o múltiples.

El modelo de datos Chinook representa una tienda de medios digitales, que incluye tablas para artistas, álbumes, pistas multimedia, facturas y clientes.

Los datos relacionados con los medios se crearon utilizando datos reales de una biblioteca de iTunes. La información de clientes y empleados se creó manualmente utilizando nombres ficticios, direcciones que se pueden ubicar en mapas de Google y otros datos bien formateados (teléfono, fax, correo electrónico, etc.). La información de ventas se genera automáticamente utilizando datos aleatorios durante un período de cuatro años.

¿Por qué el nombre Chinook? El nombre de esta base de datos de ejemplo se basó en la base de datos Northwind. Los chinooks son vientos en el interior oeste de América del Norte, donde las praderas canadienses y las grandes llanuras se encuentran con varias cadenas montañosas. Los chinooks son más frecuentes en el sur de Alberta en Canadá. Chinook es una buena opción de nombre para una base de datos que pretende ser una alternativa a Northwind.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13411583%2Fd856e0358e3a572d50f1aba5e171c1c6%2FChinook%20DataBase.png?generation=1718785749657445&alt=media" alt="">
Purchase Order Data
data.ca.gov
catalog.data.gov
csv, docx, pdf
Updated Oct 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of General Services (2019). Purchase Order Data [Dataset]. https://data.ca.gov/dataset/purchase-order-data
Explore at:
docx, csv, pdfAvailable download formats
Dataset updated
Oct 23, 2019
Dataset authored and provided by
California Department of General Services
Description
The State Contract and Procurement Registration System (SCPRS) was established in 2003, as a centralized database of information on State contracts and purchases over $5000. eSCPRS represents the data captured in the State's eProcurement (eP) system, Bidsync, as of March 16, 2009. The data provided is an extract from that system for fiscal years 2012-2013, 2013-2014, and 2014-2015

Data Limitations:
Some purchase orders have multiple UNSPSC numbers, however only first was used to identify the purchase order. Multiple UNSPSC numbers were included to provide additional data for a DGS special event however this affects the formatting of the file. The source system Bidsync is being deprecated and these issues will be resolved in the future as state systems transition to Fi$cal.

Data Collection Methodology:

The data collection process starts with a data file from eSCPRS that is scrubbed and standardized prior to being uploaded into a SQL Server database. There are four primary tables. The Supplier, Department and United Nations Standard Products and Services Code (UNSPSC) tables are reference tables. The Supplier and Department tables are updated and mapped to the appropriate numbering schema and naming conventions. The UNSPSC table is used to categorize line item information and requires no further manipulation. The Purchase Order table contains raw data that requires conversion to the correct data format and mapping to the corresponding data fields. A stacking method is applied to the table to eliminate blanks where needed. Extraneous characters are removed from fields. The four tables are joined together and queries are executed to update the final Purchase Order Dataset table. Once the scrubbing and standardization process is complete the data is then uploaded into the SQL Server database.

Secondary/Related Resources:

State Contract Manual (SCM) vol. 2 http://www.dgs.ca.gov/pd/Resources/publications/SCM2.aspx

State Contract Manual (SCM) vol. 3 http://www.dgs.ca.gov/pd/Resources/publications/SCM3.aspx

Buying Green http://www.dgs.ca.gov/buyinggreen/Home.aspx

United Nations Standard Products and Services Code, http://www.unspsc.org/
(Sunset)📒 Meta Kaggle ported to MS SQL SERVER
kaggle.com
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BwandoWando (2024). (Sunset)📒 Meta Kaggle ported to MS SQL SERVER [Dataset]. http://doi.org/10.34740/kaggle/dsv/7896543
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/7896543
Dataset updated
Mar 20, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BwandoWando
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I've always wanted to explore Kaggle's Meta Kaggle dataset but I am more comfortable on using TSQL when it comes to writing (very) complex queries. Also, I tend to write queries faster when using SQL MANAGEMENT STUDIO, like 100x faster. So, I ported Kaggle's Meta Kaggle dataset into MS SQL SERVER 2022 database format, created a backup file, then uploaded it here.

MSSQL VERSION: SQL Server 2022

Collation: SQL_Latin1_General_CP1_CI_AS

Recovery model: simple

Requirements

Download and install the SQL SERVER 2022 Developer edition here

Download the backup file

Restore the backup file into your local. If you havent done this before, it's easy and straightforward. Here is a guide.

(QUOTED FROM THE ORIGINAL DATASET)

Meta Kaggle

Explore Kaggle's public data on competitions, datasets, kernels (code/ notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but they think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F2ad97bce7839d6e57674e7a82981ed23%2F2Egeb8R.png?generation=1688912953875842&alt=media" alt="">

Notes

I repeat, I just ported the dataset. All credits to Kaggle for the amazing source dataset.

Cover image from https://picryl.com/media/space-earth-bug-ce3ca6
a
Coursera - Data Science Fundamentals with Python and SQL
academictorrents.com
bittorrent
Updated May 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
None (2021). Coursera - Data Science Fundamentals with Python and SQL [Dataset]. https://academictorrents.com/details/bdc0bb1499b1992a5488b4bbcfc9288c30793c08
Explore at:
bittorrent(807986536)Available download formats
Dataset updated
May 2, 2021
Authors
None
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
A BitTorrent file to download data with the title 'Coursera - Data Science Fundamentals with Python and SQL'
wikisql
huggingface.co
Updated Aug 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salesforce (2023). wikisql [Dataset]. https://huggingface.co/datasets/Salesforce/wikisql
Explore at:
Dataset updated
Aug 28, 2023
Dataset provided by
Salesforce Inchttp://salesforce.com/
Authors
Salesforce
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
A large crowd-sourced dataset for developing natural language interfaces for relational databases
O*NET Database
onetcenter.org
excel, mysql, oracle +2
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Center for O*NET Development (2025). O*NET Database [Dataset]. https://www.onetcenter.org/database.html
Explore at:
oracle, sql server, text, mysql, excelAvailable download formats
Dataset updated
May 20, 2025
Dataset provided by
Occupational Information Network
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Dataset funded by
US Department of Labor, Employment and Training Administration
Description
The O*NET Database contains hundreds of standardized and occupation-specific descriptors on almost 1,000 occupations covering the entire U.S. economy. The database, which is available to the public at no cost, is continually updated by a multi-method data collection program. Sources of data include: job incumbents, occupational experts, occupational analysts, employer job postings, and customer/professional association input.
Data content areas include:
Worker Characteristics (e.g., Abilities, Interests, Work Styles)
Worker Requirements (e.g., Education, Knowledge, Skills)
Experience Requirements (e.g., On-the-Job Training, Work Experience)
Occupational Requirements (e.g., Detailed Work Activities, Work Context)
Occupation-Specific Information (e.g., Job Titles, Tasks, Technology Skills)
SQL Bike Stores
kaggle.com
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed ZRIRAK (2024). SQL Bike Stores [Dataset]. https://www.kaggle.com/datasets/mohamedzrirak/sql-bkestores
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohamed ZRIRAK
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Download: SQL Query This SQL project is focused on analyzing sales data from a relational database to gain insights into customer behavior, store performance, product sales, and the effectiveness of sales representatives. By executing a series of complex SQL queries across multiple tables, the project aggregates key metrics, such as total units sold and total revenue, and links them with customer, store, product, and staff details.

Key Objectives:

Customer Analysis: Understand customer purchasing patterns by analyzing the total number of units and revenue generated per customer. Product and Category Insights: Evaluate product performance and its category’s impact on overall sales. Store Performance: Identify which stores generate the most revenue and handle the highest sales volume. Sales Representative Effectiveness: Assess the performance of sales representatives by linking sales data with each representative’s handled orders. Techniques Used:

SQL Joins: The project integrates data from multiple tables, including orders, customers, order_items, products, categories, stores, and staffs, using INNER JOIN to merge information from related tables. Aggregation: SUM functions are used to compute total units sold and revenue generated by each order, providing valuable insights into sales performance. Grouping: Data is grouped by order ID, customer, product, store, and sales representative, ensuring accurate and summarized sales metrics. Use Cases:

Business Decision-Making: The analysis can help businesses identify high-performing products and stores, optimize inventory, and evaluate the impact of sales teams. Market Segmentation: Segment customers based on geographic location (city/state) and identify patterns in purchasing behavior. Sales Strategy Optimization: Provide recommendations to improve sales strategies by analyzing product categories and sales rep performance.
W
London Development Database SQL Extract
cloud.csiss.gmu.edu
data.europa.eu
+1more
bin, pdf
Updated May 25, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Greater London Authority (GLA) (2018). London Development Database SQL Extract [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/london-development-database-sql-extract
Explore at:
pdf, binAvailable download formats
Dataset updated
May 25, 2018
Dataset provided by
Greater London Authority (GLA)
License
http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence
Area covered
London
Description
This is a copy of the London Development Database.

This is the entire LDD database exported as a .sql.tar using pg_dump. For information on how to use this file and details of the database tables please refer to the document London Development database export.pdf

The permissions data within this extract includes anything submitted to LDD by 23/05/2018. All data is provided by London’s planning authorities.

An extract from the database can be downloaded from the London Datastore and data can be viewed on a map at https://maps.london.gov.uk/map/?ldd
h
sql-create-context
huggingface.co
opendatalab.com
Updated Apr 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
brianm (2023). sql-create-context [Dataset]. https://huggingface.co/datasets/b-mc2/sql-create-context
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 21, 2023
Authors
brianm
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This dataset builds from WikiSQL and Spider. There are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets. The CREATE TABLE statement can often be copy and pasted from different DBMS and provides table names, column… See the full description on the dataset page: https://huggingface.co/datasets/b-mc2/sql-create-context.
f
A user-friendly extract of the LibGen scimag metadata SQL dump on 2017-04-07...
figshare.com
application/x-rar
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Himmelstein; Stephen McLaughlin (2023). A user-friendly extract of the LibGen scimag metadata SQL dump on 2017-04-07 [Dataset]. http://doi.org/10.6084/m9.figshare.5231245.v1
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5231245.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Daniel Himmelstein; Stephen McLaughlin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the metadata for the LibGen scimag database of full-text scholarly documents. Each row of this dataset corresponds to a scholarly document in the LibGen scimag database, as identified by its DOI.scimag_dbbackup-2017-04-07.rar was downloaded from http://libgen.io/dbdumps/backup_archive/scimag_dbbackup-2017-04-07.rar. It's a compressed SQL dump of the LibGen scimag metadata database on 2017-04-07. This is the unmodified file downloaded from libgen.io. It encodes a single table named scimag.libgen-scimag-2017-04-07.tsv.xz contains a TSV version of the scimag table from scimag_dbbackup-2017-04-07.rar. It's more user-friendly because it provides access to the data without requiring MySQL, is UTF-8 encoded, and has null bytes removed.The code that downloaded and processed these datasets is at https://git.io/v7Uh4. Users should note that the TimeAdded column appears to store the modification rather than the creation date for each DOI. As discussed in https://doi.org/b9s5, this field should not be mistaken for the date of first upload to LibGen scimag.
Repackaged Full ITIS Data Set (MS SQL Server)
zenodo.org
application/gzip, zip
Updated Oct 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Integrated Taxonomic Information System; Integrated Taxonomic Information System (2021). Repackaged Full ITIS Data Set (MS SQL Server) [Dataset]. http://doi.org/10.5281/zenodo.3833105
Explore at:
application/gzip, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3833105
Dataset updated
Oct 20, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Integrated Taxonomic Information System; Integrated Taxonomic Information System
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Retrieved 18 May 2020, from the Integrated Taxonomic Information System (ITIS) (http://www.itis.gov). via https://www.itis.gov/downloads/itisMSSql.zip .

The archive itisMSSql.zip was unzipped, and repackaged as individual gzipped files. The original zip file is included in this data publication.

Files in this publication:

1. itisMSSql.zip - file downloaded from https://www.itis.gov/downloads/itisMSSql.zip on 18 May 2020

2. Files ending with .gz (e.g., taxonomic_units.gz, taxonomic_units.gz, synonym_links.gz) - repackaged, gzipped, content of itisMSSql.zip
P
BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation)...
paperswithcode.com
Updated Jan 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyang Li; Binyuan Hui; Ge Qu; Jiaxi Yang; Binhua Li; Bowen Li; Bailin Wang; Bowen Qin; Rongyu Cao; Ruiying Geng; Nan Huo; Xuanhe Zhou; Chenhao Ma; Guoliang Li; Kevin C. C. Chang; Fei Huang; Reynold Cheng; Yongbin Li (2024). BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) Dataset [Dataset]. https://paperswithcode.com/dataset/bird-sql
Explore at:
Dataset updated
Jan 5, 2024
Authors
Jinyang Li; Binyuan Hui; Ge Qu; Jiaxi Yang; Binhua Li; Bowen Li; Bailin Wang; Bowen Qin; Rongyu Cao; Ruiying Geng; Nan Huo; Xuanhe Zhou; Chenhao Ma; Guoliang Li; Kevin C. C. Chang; Fei Huang; Reynold Cheng; Yongbin Li
Description
BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) represents a pioneering, cross-domain dataset that examines the impact of extensive database contents on text-to-SQL parsing. BIRD contains over 12,751 unique question-SQL pairs and 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc.
a
[udemy] SQL Masterclass for Financial Analysis & Financial Reporting
academictorrents.com
bittorrent
Updated Dec 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irfan Sharif (2024). [udemy] SQL Masterclass for Financial Analysis & Financial Reporting [Dataset]. https://academictorrents.com/details/161282939abe2462e37cd8a59664043716a1a529
Explore at:
bittorrent(735121934)Available download formats
Dataset updated
Dec 26, 2024
Dataset authored and provided by
Irfan Sharif
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Official Course URL: udemy.com/course/sql-for-financial-data-analysis/ Course Overview: Unlock the power of SQL for financial data analysis and reporting. This course is tailored for non-tech professionals who want to streamline their analytics and reporting capabilities. Learn to extract and process financial data, prepare detailed reports like Profit & Loss Statements and Balance Sheets, and calculate critical financial ratios through practical exercises. What You ll Learn: - SQL Basics: Master database querying techniques for financial data. - Report Preparation: Create Profit & Loss Statements, Balance Sheets, and Cash Flow Statements. - Key Analytics: Calculate and interpret profitability, efficiency, and liquidity ratios. - Database Skills: Gain hands-on experience without prior technical expertise. Course Benefits: - Practical Applications: Apply SQL to real-world financial scenarios. - Independent Reporting: Reduce reliance on system-generated reports. - Career Advancem
d
Current Population Survey (CPS)
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/AK4FDD
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
Playlist2vec: Spotify Million Playlist Dataset
zenodo.org
data.niaid.nih.gov
bin
Updated Jun 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Piyush Papreja; Piyush Papreja (2021). Playlist2vec: Spotify Million Playlist Dataset [Dataset]. http://doi.org/10.5281/zenodo.5002584
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5002584
Dataset updated
Jun 22, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Piyush Papreja; Piyush Papreja
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was created using Spotify developer API. It consists of user-created as well as Spotify-curated playlists.
The dataset consists of 1 million playlists, 3 million unique tracks, 3 million unique albums, and 1.3 million artists.
The data is stored in a SQL database, with the primary entities being songs, albums, artists, and playlists.
Each of the aforementioned entities are represented by unique IDs (Spotify URI).
Data is stored into following tables:

album

artist

track

playlist

track_artist1

track_playlist1

album

| id | name | uri |

id: Album ID as provided by Spotify
name: Album Name as provided by Spotify
uri: Album URI as provided by Spotify

artist

| id | name | uri |

id: Artist ID as provided by Spotify
name: Artist Name as provided by Spotify
uri: Artist URI as provided by Spotify

track

| id | name | duration | popularity | explicit | preview_url | uri | album_id |

id: Track ID as provided by Spotify
name: Track Name as provided by Spotify
duration: Track Duration (in milliseconds) as provided by Spotify
popularity: Track Popularity as provided by Spotify
explicit: Whether the track has explicit lyrics or not. (true or false)
preview_url: A link to a 30 second preview (MP3 format) of the track. Can be null
uri: Track Uri as provided by Spotify
album_id: Album Id to which the track belongs

playlist

| id | name | followers | uri | total_tracks |

id: Playlist ID as provided by Spotify
name: Playlist Name as provided by Spotify
followers: Playlist Followers as provided by Spotify
uri: Playlist Uri as provided by Spotify
total_tracks: Total number of tracks in the playlist.

track_artist1

| track_id | artist_id |

Track-Artist association table

track_playlist1

| track_id | playlist_id |

Track-Playlist association table

- - - - - SETUP - - - - -

The data is in the form of a SQL dump. The download size is about 10 GB, and the database populated from it comes out to about 35GB.

spotifydbdumpschemashare.sql contains the schema for the database (for reference):
spotifydbdumpshare.sql is the actual data dump.

Setup steps:
1. Create database

- - - - - PAPER - - - - -

The description of this dataset can be found in the following paper:

Papreja P., Venkateswara H., Panchanathan S. (2020) Representation, Exploration and Recommendation of Playlists. In: Cellier P., Driessens K. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Communications in Computer and Information Science, vol 1168. Springer, Cham

Geodatabase for the Baltimore Ecosystem Study Spatial Data

portal.edirepository.org
search.dataone.org

application/vnd.rar

Updated May 4, 2012

Facebook

Twitter

Click to copy link

Link copied

Cite

Jarlath O'Neal-Dunne; Morgan Grove (2012). Geodatabase for the Baltimore Ecosystem Study Spatial Data [Dataset]. http://doi.org/10.6073/pasta/377da686246f06554f7e517de596cd2b

Explore at:

application/vnd.rar(29574980 kilobyte)Available download formats

Unique identifier

https://doi.org/10.6073/pasta/377da686246f06554f7e517de596cd2b

Dataset updated

May 4, 2012

Dataset provided by

EDI

Authors

Jarlath O'Neal-Dunne; Morgan Grove

Time period covered

Jan 1, 1999 - Jun 1, 2014

Area covered

Description

The establishment of a BES Multi-User Geodatabase (BES-MUG) allows for the storage, management, and distribution of geospatial data associated with the Baltimore Ecosystem Study. At present, BES data is distributed over the internet via the BES website. While having geospatial data available for download is a vast improvement over having the data housed at individual research institutions, it still suffers from some limitations. BES-MUG overcomes these limitations; improving the quality of the geospatial data available to BES researches, thereby leading to more informed decision-making.

   BES-MUG builds on Environmental Systems Research Institute's (ESRI) ArcGIS and ArcSDE technology. ESRI was selected because its geospatial software offers robust capabilities. ArcGIS is implemented agency-wide within the USDA and is the predominant geospatial software package used by collaborating institutions.


   Commercially available enterprise database packages (DB2, Oracle, SQL) provide an efficient means to store, manage, and share large datasets. However, standard database capabilities are limited with respect to geographic datasets because they lack the ability to deal with complex spatial relationships. By using ESRI's ArcSDE (Spatial Database Engine) in conjunction with database software, geospatial data can be handled much more effectively through the implementation of the Geodatabase model. Through ArcSDE and the Geodatabase model the database's capabilities are expanded, allowing for multiuser editing, intelligent feature types, and the establishment of rules and relationships. ArcSDE also allows users to connect to the database using ArcGIS software without being burdened by the intricacies of the database itself.


   For an example of how BES-MUG will help improve the quality and timeless of BES geospatial data consider a census block group layer that is in need of updating. Rather than the researcher downloading the dataset, editing it, and resubmitting to through ORS, access rules will allow the authorized user to edit the dataset over the network. Established rules will ensure that the attribute and topological integrity is maintained, so that key fields are not left blank and that the block group boundaries stay within tract boundaries. Metadata will automatically be updated showing who edited the dataset and when they did in the event any questions arise.


   Currently, a functioning prototype Multi-User Database has been developed for BES at the University of Vermont Spatial Analysis Lab, using Arc SDE and IBM's DB2 Enterprise Database as a back end architecture. This database, which is currently only accessible to those on the UVM campus network, will shortly be migrated to a Linux server where it will be accessible for database connections over the Internet. Passwords can then be handed out to all interested researchers on the project, who will be able to make a database connection through the Geographic Information Systems software interface on their desktop computer. 


   This database will include a very large number of thematic layers. Those layers are currently divided into biophysical, socio-economic and imagery categories. Biophysical includes data on topography, soils, forest cover, habitat areas, hydrology and toxics. Socio-economics includes political and administrative boundaries, transportation and infrastructure networks, property data, census data, household survey data, parks, protected areas, land use/land cover, zoning, public health and historic land use change. Imagery includes a variety of aerial and satellite imagery.


   See the readme: http://96.56.36.108/geodatabase_SAL/readme.txt


   See the file listing: http://96.56.36.108/geodatabase_SAL/diroutput.txt

h
spider
huggingface.co
opendatalab.com
Updated Dec 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
XLang NLP Lab (2021). spider [Dataset]. https://huggingface.co/datasets/xlangai/spider
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset authored and provided by
XLang NLP Lab
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for Spider

Dataset Summary

Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases.

Supported Tasks and Leaderboards

The leaderboard can be seen at https://yale-lily.github.io/spider

Languages

The text in the dataset is in English.

Dataset Structure Data… See the full description on the dataset page: https://huggingface.co/datasets/xlangai/spider.
d
Health and Retirement Study (HRS)
search.dataone.org
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ELEKOY
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D
e
ASSEMBLY OF FRANCE METROPOLITAN OPEN STREET MAP: GEOPACKAGE AND SQL FORMAT
data.europa.eu
plain text, zip
Updated Aug 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DELETED DELETED (2023). ASSEMBLY OF FRANCE METROPOLITAN OPEN STREET MAP: GEOPACKAGE AND SQL FORMAT [Dataset]. https://data.europa.eu/data/datasets/60c46d63ec3bdcb9d526c776?locale=en
Explore at:
zip(1197185836), zip, zip(300551415), plain text(433)Available download formats
Dataset updated
Aug 22, 2023
Dataset authored and provided by
DELETED DELETED
Area covered
Metropolitan France, France
Description
Here you will find an assembly of the open street map in metropolitan france. The geopackage version also contains data from neighbouring countries (border regions except espagne). The the.qgz project allows the geopackage data to be opened with the busy style and hacking depending on the zoom level. video presenting this data gpkg and QGIS: https://www.youtube.com/watch?v=R6O9cMqVVvM&t=6s The version.sql is characterised by an additional attribute for each geometric entity: The INSEE code This data will be updated on a monthly basis.

INSTRUCTIONS FOR DECLARING GPKG DATA: Download all files and rename as follows:

OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_001.zip — > OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.001 OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_002.zip — > OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.002 OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_003.zip — > OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.003 OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_004.zip — > OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.004

or if you know the batch back to create a.bat file containing this (or you rename the renowned file. txt as rename.bat):

pushd “% ~ DP0” REN OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_001.zip OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.001 REN OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_002.zip OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.002 REN OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_003.zip OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.003 REN OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_004.zip OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214.zip.004

and launch.bat by double clicking on it (the batch must be in the same place as the zip files)

Then right-click on the OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_001.zip file and have it extracted to “OSM_QGZ_GPKG_ET_FRONTALIER_PRDG_FXX_ED214_001\” with your pressure relief software. There is no need to click on 002, 003, 004. Opening file.001 opens all other parts of the archive

For version.sql, the procedure is the same: rename OSM_SQL_FXX_PRDG_D000_ED214_001.zip to OSM_SQL_FXX_PRDG_D000_ED214.zip.001 OSM_SQL_FXX_PRDG_D000_ED214_002.zip to OSM_SQL_FXX_PRDG_D000_ED214.zip.002 OSM_SQL_FXX_PRDG_D000_ED214_003.zip to OSM_SQL_FXX_PRDG_D000_ED214.zip.003 Then carry out pressure relief

Facebook

Twitter

Click to copy link

Link copied

Cite

Victor Zhong; Caiming Xiong; Richard Socher, WikiSQL Dataset [Dataset]. https://paperswithcode.com/dataset/wikisql

WikiSQL Dataset

Explore at:

Authors

Victor Zhong; Caiming Xiong; Richard Socher

Description

WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.

Clear search

Close search

Google apps

Main menu

WikiSQL Dataset

Northwind and Chinook DataBase

Purchase Order Data

(Sunset)📒 Meta Kaggle ported to MS SQL SERVER

Context

Requirements

(QUOTED FROM THE ORIGINAL DATASET)

Meta Kaggle

Notes

Coursera - Data Science Fundamentals with Python and SQL

wikisql

O*NET Database

SQL Bike Stores

London Development Database SQL Extract

sql-create-context

A user-friendly extract of the LibGen scimag metadata SQL dump on 2017-04-07...

Repackaged Full ITIS Data Set (MS SQL Server)

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation)...

[udemy] SQL Masterclass for Financial Analysis & Financial Reporting

Current Population Survey (CPS)

Playlist2vec: Spotify Million Playlist Dataset

Geodatabase for the Baltimore Ecosystem Study Spatial Data

spider

Health and Retirement Study (HRS)

ASSEMBLY OF FRANCE METROPOLITAN OPEN STREET MAP: GEOPACKAGE AND SQL FORMAT

WikiSQL Dataset