72 datasets found

SQL analysis using pizass data set
kaggle.com
zip
Updated Jul 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael_Dsouza16 (2024). SQL analysis using pizass data set [Dataset]. https://www.kaggle.com/datasets/michaeldsouza16/sql-analysis-using-pizass-data-set
Explore at:
zip(427330 bytes)Available download formats
Dataset updated
Jul 13, 2024
Authors
Michael_Dsouza16
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is designed for SQL analysis exercises, providing comprehensive data on pizza sales, orders, and customer preferences. It includes details on order quantities, pizza types, and the composition of various pizzas. The dataset is ideal for practicing SQL queries, performing revenue analysis, and understanding customer behavior in the pizza industry.

order_details.csv Description: Contains details of each pizza order. Columns: order_details_id: Unique identifier for the order detail. order_id: Identifier for the order. pizza_id: Identifier for the pizza type. quantity: Number of pizzas ordered

pizza_types.csv Description: Provides information on different types of pizzas available. Columns: pizza_type_id: Unique identifier for the pizza type. name: Name of the pizza. category: Category of the pizza (e.g., Chicken, Vegetarian). ingredients: List of ingredients used in the pizza.

Questions.txt Description: Contains various SQL questions for analyzing the dataset. Contents: Basic: Retrieve the total number of orders placed. Calculate the total revenue generated from pizza sales. Identify the highest-priced pizza. Identify the most common pizza size ordered. List the top 5 most ordered pizza types along with their quantities.
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
H
Current Population Survey (CPS)
dataverse.harvard.edu
search.dataone.org
Updated May 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Damico (2013). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/AK4FDD
Dataset updated
May 30, 2013
Dataset provided by
Harvard Dataverse
Authors
Anthony Damico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
eCommerce Transactions
kaggle.com
zip
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chad Wambles (2025). eCommerce Transactions [Dataset]. https://www.kaggle.com/datasets/chadwambles/ecommerce-transactions
Explore at:
zip(245430 bytes)Available download formats
Dataset updated
Jan 3, 2025
Authors
Chad Wambles
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This data set is perfect for practicing your analytical skills for Power BI, Tableau, Excel, or transform it into a CSV to practice SQL.

This use case mimics transactions for a fictional eCommerce website named EverMart Online. The 3 tables in this data set are all logically connected together with IDs.

My Power BI Use Case Explanation - Using Microsoft Power BI, I made dynamic data visualizations for revenue reporting and customer behavior reporting.

Revenue Reporting Visuals - Data Card Visual that dynamically shows Total Products Listed, Total Unique Customers, Total Transactions, and Total Revenue by Total Sales, Product Sales, or Categorical Sales. - Line Graph Visual that shows Total Revenue by Month of the entire year. This graph also changes to calculate Total Revenue by Month for the Total Sales by Product and Total Sales by Category if selected. - Bar Graph Visual showcasing Total Sales by Product. - Donut Chart Visual showcasing Total Sales by Category of Product.

Customer Behavior Reporting Visuals - Data Card Visual that dynamically shows Total Products Listed, Total Unique Customers, Total Transactions, and Total Revenue by Total or by continent selected on the map. - Interactive Map Visual showing key statistics for the continent selected. - The key statistics are presented on the tool tip when you select a continent, and the following statistics show for that continent: - Continent Name - Customer Total - Percentage of Products Sold - Percentage of Total Customers - Percentage of Total Transactions - Percentage of Total Revenue
m
Coronavirus Panoply.io for Database Warehousing and Post Analysis using...
data.mendeley.com
Updated Feb 4, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pranav Pandya (2020). Coronavirus Panoply.io for Database Warehousing and Post Analysis using Sequal Language (SQL) [Dataset]. http://doi.org/10.17632/4gphfg5tgs.2
Explore at:
Unique identifier
https://doi.org/10.17632/4gphfg5tgs.2
Dataset updated
Feb 4, 2020
Authors
Pranav Pandya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
It has never been easier to solve any database related problem using any sequel language and the following gives an opportunity for you guys to understand how I was able to figure out some of the interline relationships between databases using Panoply.io tool.

I was able to insert coronavirus dataset and create a submittable, reusable result. I hope it helps you work in Data Warehouse environment.

The following is list of SQL commands performed on dataset attached below with the final output as stored in Exports Folder QUERY 1 SELECT "Province/State" As "Region", Deaths, Recovered, Confirmed FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Deaths>0 Description: How will we estimate where Coronavirus has infiltrated, but there is effective recovery amongst patients? We can view those places by having Recovery twice more than the Death Toll.

Query 2 SELECT country, sum(confirmed) as "Confirmed Count", sum(Recovered) as "Recovered Count", sum(Deaths) as "Death Toll" FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Confirmed>0 GROUP BY country

Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries

Query 3 SELECT country as "Countries where Coronavirus has reached" FROM "public"."coronavirus_updated" WHERE confirmed>0 GROUP BY country Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries.

Query 4 SELECT country, sum(suspected) as "Suspected Cases under potential CoronaVirus outbreak" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 AND confirmed=0 GROUP BY country ORDER BY sum(suspected) DESC

Description: Coronavirus is spreading at alarming rate. In order to know which countries are newly getting the virus is important because in these countries if timely measures are taken, it could prevent any causalities. Here is a list of suspected cases with no virus resulted deaths.

Query 5 SELECT country, sum(suspected) as "Coronavirus uncontrolled spread count and human life loss", 100*sum(suspected)/(SELECT sum((suspected)) FROM "public"."coronavirus_updated") as "Global suspected Exposure of Coronavirus in percentage" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 GROUP BY country ORDER BY sum(suspected) DESC Description: Coronavirus is getting stronger in particular countries, but how will we measure that? We can measure it by knowing the percentage of suspected patients amongst countries which still doesn’t have any Coronavirus related deaths. The following is a list.

Data Provided by: SRK, Data Scientist at H2O.ai, Chennai, India
Statewide Commercial Baseline Study of New York Penetration and Saturation...
splitgraph.com
data.ny.gov
+1more
Updated Jul 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York State Energy Research and Development Authority (NYSERDA) (2024). Statewide Commercial Baseline Study of New York Penetration and Saturation of Energy Using Equipment: 2019 [Dataset]. https://www.splitgraph.com/ny-gov/statewide-commercial-baseline-study-of-new-york-umaq-yp6d
Explore at:
application/openapi+json, json, application/vnd.splitgraph.imageAvailable download formats
Dataset updated
Jul 1, 2024
Dataset provided by
New York State Energy Research and Development Authorityhttps://www.nyserda.ny.gov/
Authors
New York State Energy Research and Development Authority (NYSERDA)
Area covered
New York
Description
This dataset includes all Statewide Commercial Baseline Study summary statistics related to the estimation of population penetration and saturation estimates. These include summaries of the number of survey respondents asked each equation, the number of survey respondents who provided a valid answer, the unweighted penetration, weighted penetration, and adjusted and weighted penetration. All supporting summary statistics are also provided. Penetration refers to the proportion of businesses that have one or more of a particular piece of equipment. Saturation is a number representing how many of a particular piece of equipment are present, on average, among all businesses. The overall objective of the Statewide Commercial Baseline research was to understand the existing commercial building stock in New York State and associated energy use, including the penetration and saturation of energy consuming equipment (electric, natural gas, and other fuels). For more information, see the Final Report at https://www.nyserda.ny.gov/About/Publications/Building-Stock-and-Potential-Studies/Commercial-Statewide-Baseline-Study.

NYSERDA offers objective information and analysis, innovative programs, technical expertise, and support to help New Yorkers increase energy efficiency, save money, use renewable energy, and accelerate economic growth. reduce reliance on fossil fuels. To learn more about NYSERDA’s programs, visit nyserda.ny.gov or follow us on X, Facebook, YouTube, or Instagram.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
S
City-Level Descriptive Statistics for GHG Inventory
splitgraph.com
data.kcmo.org
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jerry Shechter (2023). City-Level Descriptive Statistics for GHG Inventory [Dataset]. https://www.splitgraph.com/kcmo/citylevel-descriptive-statistics-for-ghg-inventory-u9uw-758m
Explore at:
json, application/vnd.splitgraph.image, application/openapi+jsonAvailable download formats
Dataset updated
Dec 28, 2023
Dataset authored and provided by
Jerry Shechter
Description
This data set contains community statistics that were used to calculate greenhouse gas emissions (GHG) for the purposes of the 2013 GHG inventory.

Data sources include US Census Bureau, Mid-America Regional Council (MARC), Jackson County Assessor Office, KCP&L electric company, Missouri Gas/Laclede gas company, Federal Highway Administration Office of Highway Policy Information Highway Statistics Series, Climate Action and Climate Protection Software notes, Kansas City Area Transit Authority (KCATA), EPA flight and large emitter website (http://ghgdata.epa.gov), City of Kansas City PUblic Words and Water Services Departments

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
a
SRP All COMPASS GW Site Summary in New Jersey
share-open-data-njtpa.hub.arcgis.com
njogis-newjersey.opendata.arcgis.com
+1more
Updated Jul 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NJDEP Bureau of GIS (2025). SRP All COMPASS GW Site Summary in New Jersey [Dataset]. https://share-open-data-njtpa.hub.arcgis.com/datasets/njdep::srp-all-compass-gw-site-summary-in-new-jersey
Explore at:
Dataset updated
Jul 8, 2025
Dataset authored and provided by
NJDEP Bureau of GIS
Area covered

Description
This GIS layer is based on a SQL query of the groundwater HAZSITE data that resides in COMPASS for each active Site Remediation case. Once the raw groundwater HAZSITE data is extracted from COMPASS, it is summarized such that a maximum concentration for the contaminant is derived for the year preceeding the last sampling event (samp_last_max_conc) and a maximum concentration is also generated for all sampling events (all_max_conc) . Each active Site Remediation case is included in the GIS layer. For the HAZSITE data, there are a number of considerations that need to be taken into account when using this GIS layer for decision making purposes:- Not all SRP cases have provided HAZSITE data to the Department or HAZSITE data that has been provided to the Department may be incomplete;- Additional sampling may have been conducted since the last round of HAZSITE data was submitted that has not yet been provided as HAZSITE data is only required with key document submittals;- HAZSITE data that was submitted may not have been provided in the correct format and therefore could not be uploaded into the COMPASS data repository and would therefore not be returned via the COMPASS SQL query.
S
NHIS Adult Summary Health Statistics
splitgraph.com
healthdata.gov
+3more
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NCHS/DHIS (2024). NHIS Adult Summary Health Statistics [Dataset]. https://www.splitgraph.com/cdc-gov/nhis-adult-summary-health-statistics-25m4-6qqq
Explore at:
json, application/openapi+json, application/vnd.splitgraph.imageAvailable download formats
Dataset updated
Jul 22, 2024
Dataset authored and provided by
NCHS/DHIS
Description
Interactive Summary Health Statistics for Adults provide annual estimates of selected health topics for adults aged 18 years and over based on final data from the National Health Interview Survey.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
S
Independent dispute resolution summary data
splitgraph.com
data.texas.gov
+2more
Updated Oct 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
texas-gov (2024). Independent dispute resolution summary data [Dataset]. https://www.splitgraph.com/texas-gov/independent-dispute-resolution-summary-data-bn27-65ad
Explore at:
application/vnd.splitgraph.image, application/openapi+json, jsonAvailable download formats
Dataset updated
Oct 15, 2024
Authors
texas-gov
Description
The Texas Department of Insurance administers Independent Dispute Resolution (IDR), a mediation and arbitration process for certain health care billing disputes between out-of-network providers and health plans. Mediation is used for billing disputes between out-of-network facilities and health plans. Arbitration is used for billing disputes between out-of-network health care providers (not facilities) and health plans. Medical services or supplies received on or after January 1, 2020 may be eligible for IDR. To learn more, go to the TDI webpage, Mediation and arbitration of medical bills.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
(Sunset)📒 Meta Kaggle ported to MS SQL SERVER
kaggle.com
zip
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BwandoWando (2024). (Sunset)📒 Meta Kaggle ported to MS SQL SERVER [Dataset]. https://www.kaggle.com/datasets/bwandowando/meta-kaggle-ported-to-sql-server-2022-database
Explore at:
zip(8635902534 bytes)Available download formats
Dataset updated
Mar 20, 2024
Authors
BwandoWando
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I've always wanted to explore Kaggle's Meta Kaggle dataset but I am more comfortable on using TSQL when it comes to writing (very) complex queries. Also, I tend to write queries faster when using SQL MANAGEMENT STUDIO, like 100x faster. So, I ported Kaggle's Meta Kaggle dataset into MS SQL SERVER 2022 database format, created a backup file, then uploaded it here.

MSSQL VERSION: SQL Server 2022

Collation: SQL_Latin1_General_CP1_CI_AS

Recovery model: simple

Requirements

Download and install the SQL SERVER 2022 Developer edition here

Download the backup file

Restore the backup file into your local. If you havent done this before, it's easy and straightforward. Here is a guide.

(QUOTED FROM THE ORIGINAL DATASET)

Meta Kaggle

Explore Kaggle's public data on competitions, datasets, kernels (code/ notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but they think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F2ad97bce7839d6e57674e7a82981ed23%2F2Egeb8R.png?generation=1688912953875842&alt=media" alt="">

Notes

I repeat, I just ported the dataset. All credits to Kaggle for the amazing source dataset.

Cover image from https://picryl.com/media/space-earth-bug-ce3ca6
Steam Dataset 2025: Multi-Modal Gaming Analytics
kaggle.com
zip
Updated Oct 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CrainBramp (2025). Steam Dataset 2025: Multi-Modal Gaming Analytics [Dataset]. https://www.kaggle.com/datasets/crainbramp/steam-dataset-2025-multi-modal-gaming-analytics
Explore at:
zip(12478964226 bytes)Available download formats
Dataset updated
Oct 7, 2025
Authors
CrainBramp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Steam Dataset 2025: Multi-Modal Gaming Analytics Platform

The first multi-modal Steam dataset with semantic search capabilities. 239,664 applications collected from official Steam Web APIs with PostgreSQL database architecture, vector embeddings for content discovery, and comprehensive review analytics.

Made by a lifelong gamer for the gamer in all of us. Enjoy!🎮

GitHub Repository https://github.com/vintagedon/steam-dataset-2025

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F4b7eb73ac0f2c3cc9f0d57f37321b38f%2FScreenshot%202025-10-18%20180450.png?generation=1760825194507387&alt=media" alt=""> 1024-dimensional game embeddings projected to 2D via UMAP reveal natural genre clustering in semantic space

What Makes This Different

Unlike traditional flat-file Steam datasets, this is built as an analytically-native database optimized for advanced data science workflows:

☑️ Semantic Search Ready - 1024-dimensional BGE-M3 embeddings enable content-based game discovery beyond keyword matching

☑️ Multi-Modal Architecture - PostgreSQL + JSONB + pgvector in unified database structure

☑️ Production Scale - 239K applications vs typical 6K-27K in existing datasets

☑️ Complete Review Corpus - 1,048,148 user reviews with sentiment and metadata

☑️ 28-Year Coverage - Platform evolution from 1997-2025

☑️ Publisher Networks - Developer and publisher relationship data for graph analysis

☑️ Complete Methodology & Infrastructure - Full work logs document every technical decision and challenge encountered, while my API collection scripts, database schemas, and processing pipelines enable you to update the dataset, fork it for customized analysis, learn from real-world data engineering workflows, or critique and improve the methodology

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F649e9f7f46c6ce213101d0948c89e8ac%2F4_price_distribution_by_top_10_genres.png?generation=1760824835918620&alt=media" alt=""> Market segmentation and pricing strategy analysis across top 10 genres

What's Included

Core Data (CSV Exports): - 239,664 Steam applications with complete metadata - 1,048,148 user reviews with scores and statistics - 13 normalized relational tables for pandas/SQL workflows - Genre classifications, pricing history, platform support - Hardware requirements (min/recommended specs) - Developer and publisher portfolios

Advanced Features (PostgreSQL): - Full database dump with optimized indexes - JSONB storage preserving complete API responses - Materialized columns for sub-second query performance - Vector embeddings table (pgvector-ready)

Documentation: - Complete data dictionary with field specifications - Database schema documentation - Collection methodology and validation reports

Example Analysis: Published Notebooks (v1.0)

Three comprehensive analysis notebooks demonstrate dataset capabilities. All notebooks render directly on GitHub with full visualizations and output:

📊 Platform Evolution & Market Landscape

View on GitHub | PDF Export
28 years of Steam's growth, genre evolution, and pricing strategies.

🔍 Semantic Game Discovery

View on GitHub | PDF Export
Content-based recommendations using vector embeddings across genre boundaries.

🎯 The Semantic Fingerprint

View on GitHub | PDF Export
Genre prediction from game descriptions - demonstrates text analysis capabilities.

Notebooks render with full output on GitHub. Kaggle-native versions planned for v1.1 release. CSV data exports included in dataset for immediate analysis.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F4079e43559d0068af00a48e2c31f0f1d%2FScreenshot%202025-10-18%20180214.png?generation=1760824950649726&alt=media" alt=""> *Steam platfor...
d
Health and Retirement Study (HRS)
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ELEKOY
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D
S
Census Demographics
splitgraph.com
data.brla.gov
+3more
Updated Dec 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Information Services (2021). Census Demographics [Dataset]. https://www.splitgraph.com/brla-gov/census-demographics-xsrb-mxqt/
Explore at:
json, application/vnd.splitgraph.image, application/openapi+jsonAvailable download formats
Dataset updated
Dec 15, 2021
Dataset authored and provided by
Information Services
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Summary statistics from the 2000 and 2010 United States Census including population, demographics, education, and housing information for each block group in East Baton Rouge Parish, Louisiana.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
Statewide Commercial Baseline Study of New York Means of Energy Using...
splitgraph.com
data.ny.gov
+1more
Updated Jul 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York State Energy Research and Development Authority (NYSERDA) (2024). Statewide Commercial Baseline Study of New York Means of Energy Using Equipment: 2019 [Dataset]. https://www.splitgraph.com/ny-gov/statewide-commercial-baseline-study-of-new-york-ttu3-cutd
Explore at:
application/vnd.splitgraph.image, application/openapi+json, jsonAvailable download formats
Dataset updated
Jul 1, 2024
Dataset provided by
New York State Energy Research and Development Authorityhttps://www.nyserda.ny.gov/
Authors
New York State Energy Research and Development Authority (NYSERDA)
Area covered
New York
Description
The overall objective of the Statewide Commercial Baseline research was to understand the existing commercial building stock in New York State and associated energy use, including the means of energy using equipment. This dataset provides all characteristics that are presented as averages, such as the average square footage of businesses or the average cooling capacity of split systems. All supporting summary statistics are also provided. For more information, see the Final Report at https://www.nyserda.ny.gov/About/Publications/Building-Stock-and-Potential-Studies/Commercial-Statewide-Baseline-Study

NYSERDA offers objective information and analysis, innovative programs, technical expertise, and support to help New Yorkers increase energy efficiency, save money, use renewable energy, and accelerate economic growth. reduce reliance on fossil fuels. To learn more about NYSERDA’s programs, visit nyserda.ny.gov or follow us on X, Facebook, YouTube, or Instagram.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
S
NHIS Adult 3-Year Summary Health Statistics
splitgraph.com
data.virginia.gov
+2more
Updated Mar 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NCHS/DHIS (2023). NHIS Adult 3-Year Summary Health Statistics [Dataset]. https://www.splitgraph.com/cdc-gov/nhis-adult-3year-summary-health-statistics-krhz-spsc
Explore at:
json, application/openapi+json, application/vnd.splitgraph.imageAvailable download formats
Dataset updated
Mar 30, 2023
Dataset authored and provided by
NCHS/DHIS
Description
Interactive Summary Health Statistics for Adults, by Detailed Race and Ethnicity provide estimates as three-year averages of selected health topics for adults aged 18 years and over based on final data from the National Health Interview Survey.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
2021 Final Assisted Reproductive Technology (ART) Summary
splitgraph.com
data.virginia.gov
+3more
Updated Sep 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centers for Disease Control and Prevention National Center for Chronic Disease Prevention and Health Promotion Division of Reproductive Health (DRH) (2024). 2021 Final Assisted Reproductive Technology (ART) Summary [Dataset]. https://www.splitgraph.com/cdc-gov/2021-final-assisted-reproductive-technology-art-9tjt-seye
Explore at:
json, application/openapi+json, application/vnd.splitgraph.imageAvailable download formats
Dataset updated
Sep 11, 2024
Dataset provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
Authors
Centers for Disease Control and Prevention National Center for Chronic Disease Prevention and Health Promotion Division of Reproductive Health (DRH)
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Data were updated on September 11, 2024.

ART data are made available as part of the National ART Surveillance System (NASS) that collects success rates, services, profiles and annual summary data from fertility clinics across the U.S. There are four datasets available: ART Services and Profiles, ART Patient and Cycle Characteristics, ART Success Rates, and ART Summary. All four datasets may be linked by “ClinicID.” ClinicID is a unique identifier for each clinic that reported cycles. The Summary dataset provides a full snapshot of clinic services and profile, patient characteristics, and ART success rates. It is worth noting that patient medical characteristics, such as age, diagnosis, and ovarian reserve, affect ART treatment’s success. Comparison of success rates across clinics may not be meaningful because of differences in patient populations and ART treatment methods. The success rates displayed in this dataset do not reflect any one patient’s chance of success. Patients should consult with a doctor to understand their chance of success based on their own characteristics.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
S
SQL In-Memory Database Report
archivemarketresearch.com
doc, pdf, ppt
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). SQL In-Memory Database Report [Dataset]. https://www.archivemarketresearch.com/reports/sql-in-memory-database-28161
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Feb 15, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The SQL In-Memory Database market is projected to witness significant growth in the coming years, driven by the increasing need for real-time data processing and analytics. The rise of big data and the Internet of Things (IoT) has led to an explosion of data, making it essential for businesses to have the ability to quickly and efficiently process and analyze data in order to gain actionable insights. SQL In-Memory Databases, which store data in memory rather than on disk, offer superior performance and speed, making them ideal for handling large and complex datasets in real-time. The growing adoption of cloud computing is another factor contributing to the growth of the SQL In-Memory Database market. Cloud-based SQL In-Memory Databases offer a number of advantages, including scalability, flexibility, and cost-effectiveness. They allow businesses to easily scale their database up or down as needed, and they eliminate the need for expensive hardware and maintenance costs. As a result, cloud-based SQL In-Memory Databases are becoming increasingly popular with businesses of all sizes.
S
Injury/Illness Summary - Operational Data (Form 55)
splitgraph.com
data.transportation.gov
+2more
Updated Oct 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datahub-transportation-gov (2024). Injury/Illness Summary - Operational Data (Form 55) [Dataset]. https://www.splitgraph.com/datahub-transportation-gov/injuryillness-summary-operational-data-form-55-m8i6-zdsy/
Explore at:
application/openapi+json, json, application/vnd.splitgraph.imageAvailable download formats
Dataset updated
Oct 5, 2024
Authors
datahub-transportation-gov
Description
This dataset is in a user-friendly human-readable format. To download the source dataset that contains raw data values, go here: https://data.transportation.gov/dataset/Form-55-Source-Table/unww-uhxd.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
D
Data Modeling Tool Report
datainsightsmarket.com
doc, pdf, ppt
Updated Nov 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Data Modeling Tool Report [Dataset]. https://www.datainsightsmarket.com/reports/data-modeling-tool-1455486
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Nov 9, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global Data Modeling Tool market is experiencing robust expansion, projected to reach an estimated value of $3,500 million by 2025, with a Compound Annual Growth Rate (CAGR) of approximately 12% anticipated from 2025 to 2033. This significant growth is fueled by the escalating need for efficient data management and architectural design across all business sizes. Small and Medium-sized Enterprises (SMEs) are increasingly adopting these tools to streamline their database development and enhance data integrity, moving from manual processes to more sophisticated modeling. Large enterprises, on the other hand, leverage advanced data modeling capabilities for complex data warehousing, big data analytics, and ensuring compliance with evolving data governance regulations. The ongoing digital transformation initiatives worldwide, coupled with the growing volume and complexity of data, are primary drivers for this market. Furthermore, the increasing demand for cloud-based solutions, offering scalability, accessibility, and cost-effectiveness, is reshaping the deployment landscape, with cloud-based models showing a stronger trajectory compared to on-premises solutions. The market dynamics are further shaped by several key trends. The integration of Artificial Intelligence (AI) and Machine Learning (ML) into data modeling tools is emerging as a significant differentiator, enabling automated schema generation, anomaly detection, and predictive data quality analysis. This enhances user productivity and accuracy. Collaboration features are also gaining prominence, allowing distributed teams to work seamlessly on database designs. However, the market faces certain restraints, including the initial cost of sophisticated tools, the need for specialized expertise to utilize advanced features effectively, and potential resistance to change from organizations accustomed to legacy systems. The competitive landscape is characterized by a mix of established players like IBM, Oracle, and SAP, alongside innovative niche providers such as Vertabelo, SQL Database Modeler, and Archi, all vying for market share through continuous product development and strategic partnerships. The Asia Pacific region, driven by rapid economic growth and widespread digital adoption in countries like China and India, is expected to be a significant growth engine for the data modeling tool market. This comprehensive report delves into the dynamic global Data Modeling Tool market, offering an in-depth analysis from the Historical Period of 2019-2024 through to the Forecast Period of 2025-2033, with 2025 serving as the Base Year and Estimated Year. We project the market to reach substantial valuations, with an estimated market size of $5.2 billion in 2025, and forecast a Compound Annual Growth Rate (CAGR) of 12.5%, pushing the market value to an impressive $13.8 billion by 2033. The study meticulously examines key market drivers, challenges, trends, and opportunities, providing actionable insights for stakeholders.

Facebook

Twitter

Click to copy link

Link copied

Cite

Michael_Dsouza16 (2024). SQL analysis using pizass data set [Dataset]. https://www.kaggle.com/datasets/michaeldsouza16/sql-analysis-using-pizass-data-set

SQL analysis using pizass data set

A Comprehensive Dataset for Analyzing Pizza Sales, Orders, and Customer Preferen

Explore at:

zip(427330 bytes)Available download formats

Dataset updated

Jul 13, 2024

Authors

Michael_Dsouza16

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset is designed for SQL analysis exercises, providing comprehensive data on pizza sales, orders, and customer preferences. It includes details on order quantities, pizza types, and the composition of various pizzas. The dataset is ideal for practicing SQL queries, performing revenue analysis, and understanding customer behavior in the pizza industry.

order_details.csv Description: Contains details of each pizza order. Columns: order_details_id: Unique identifier for the order detail. order_id: Identifier for the order. pizza_id: Identifier for the pizza type. quantity: Number of pizzas ordered
pizza_types.csv Description: Provides information on different types of pizzas available. Columns: pizza_type_id: Unique identifier for the pizza type. name: Name of the pizza. category: Category of the pizza (e.g., Chicken, Vegetarian). ingredients: List of ingredients used in the pizza.
Questions.txt Description: Contains various SQL questions for analyzing the dataset. Contents: Basic: Retrieve the total number of orders placed. Calculate the total revenue generated from pizza sales. Identify the highest-priced pizza. Identify the most common pizza size ordered. List the top 5 most ordered pizza types along with their quantities.

Clear search

Close search

Google apps

Main menu

SQL analysis using pizass data set

Data and tools for studying isograms

Current Population Survey (CPS)

eCommerce Transactions

Coronavirus Panoply.io for Database Warehousing and Post Analysis using...

Statewide Commercial Baseline Study of New York Penetration and Saturation...

City-Level Descriptive Statistics for GHG Inventory

SRP All COMPASS GW Site Summary in New Jersey

NHIS Adult Summary Health Statistics

Independent dispute resolution summary data

(Sunset)📒 Meta Kaggle ported to MS SQL SERVER

Context

Requirements

(QUOTED FROM THE ORIGINAL DATASET)

Meta Kaggle

Notes

Steam Dataset 2025: Multi-Modal Gaming Analytics

Steam Dataset 2025: Multi-Modal Gaming Analytics Platform

What Makes This Different

What's Included

Example Analysis: Published Notebooks (v1.0)

📊 Platform Evolution & Market Landscape

🔍 Semantic Game Discovery

🎯 The Semantic Fingerprint

Health and Retirement Study (HRS)

Census Demographics

Statewide Commercial Baseline Study of New York Means of Energy Using...

NHIS Adult 3-Year Summary Health Statistics

2021 Final Assisted Reproductive Technology (ART) Summary

SQL In-Memory Database Report

Injury/Illness Summary - Operational Data (Form 55)

Data Modeling Tool Report

SQL analysis using pizass data set

A Comprehensive Dataset for Analyzing Pizza Sales, Orders, and Customer Preferen