72 datasets found
  1. SQL analysis using pizass data set

    • kaggle.com
    zip
    Updated Jul 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael_Dsouza16 (2024). SQL analysis using pizass data set [Dataset]. https://www.kaggle.com/datasets/michaeldsouza16/sql-analysis-using-pizass-data-set
    Explore at:
    zip(427330 bytes)Available download formats
    Dataset updated
    Jul 13, 2024
    Authors
    Michael_Dsouza16
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is designed for SQL analysis exercises, providing comprehensive data on pizza sales, orders, and customer preferences. It includes details on order quantities, pizza types, and the composition of various pizzas. The dataset is ideal for practicing SQL queries, performing revenue analysis, and understanding customer behavior in the pizza industry.

    1. order_details.csv Description: Contains details of each pizza order. Columns: order_details_id: Unique identifier for the order detail. order_id: Identifier for the order. pizza_id: Identifier for the pizza type. quantity: Number of pizzas ordered

    2. pizza_types.csv Description: Provides information on different types of pizzas available. Columns: pizza_type_id: Unique identifier for the pizza type. name: Name of the pizza. category: Category of the pizza (e.g., Chicken, Vegetarian). ingredients: List of ingredients used in the pizza.

    3. Questions.txt Description: Contains various SQL questions for analyzing the dataset. Contents: Basic: Retrieve the total number of orders placed. Calculate the total revenue generated from pizza sales. Identify the highest-priced pizza. Identify the most common pizza size ordered. List the top 5 most ordered pizza types along with their quantities.

  2. Data and tools for studying isograms

    • figshare.com
    Updated Jul 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Jul 31, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Florian Breit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

    Label Data type Description

    isogramy int The order of isogramy, e.g. "2" is a second order isogram

    length int The length of the word in letters

    word text The actual word/isogram in ASCII

    source_pos text The Part of Speech tag from the original corpus

    count int Token count (total number of occurences)

    vol_count int Volume count (number of different sources which contain the word)

    count_per_million int Token count per million words

    vol_count_as_percent int Volume count as percentage of the total number of volumes

    is_palindrome bool Whether the word is a palindrome (1) or not (0)

    is_tautonym bool Whether the word is a tautonym (1) or not (0)

    The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

    Label

    Data type

    Description

    !total_1grams

    int

    The total number of words in the corpus

    !total_volumes

    int

    The total number of volumes (individual sources) in the corpus

    !total_isograms

    int

    The total number of isograms found in the corpus (before compacting)

    !total_palindromes

    int

    How many of the isograms found are palindromes

    !total_tautonyms

    int

    How many of the isograms found are tautonyms

    The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

  3. H

    Current Population Survey (CPS)

    • dataverse.harvard.edu
    • search.dataone.org
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  4. eCommerce Transactions

    • kaggle.com
    zip
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chad Wambles (2025). eCommerce Transactions [Dataset]. https://www.kaggle.com/datasets/chadwambles/ecommerce-transactions
    Explore at:
    zip(245430 bytes)Available download formats
    Dataset updated
    Jan 3, 2025
    Authors
    Chad Wambles
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This data set is perfect for practicing your analytical skills for Power BI, Tableau, Excel, or transform it into a CSV to practice SQL.

    This use case mimics transactions for a fictional eCommerce website named EverMart Online. The 3 tables in this data set are all logically connected together with IDs.

    My Power BI Use Case Explanation - Using Microsoft Power BI, I made dynamic data visualizations for revenue reporting and customer behavior reporting.

    Revenue Reporting Visuals - Data Card Visual that dynamically shows Total Products Listed, Total Unique Customers, Total Transactions, and Total Revenue by Total Sales, Product Sales, or Categorical Sales. - Line Graph Visual that shows Total Revenue by Month of the entire year. This graph also changes to calculate Total Revenue by Month for the Total Sales by Product and Total Sales by Category if selected. - Bar Graph Visual showcasing Total Sales by Product. - Donut Chart Visual showcasing Total Sales by Category of Product.

    Customer Behavior Reporting Visuals - Data Card Visual that dynamically shows Total Products Listed, Total Unique Customers, Total Transactions, and Total Revenue by Total or by continent selected on the map. - Interactive Map Visual showing key statistics for the continent selected. - The key statistics are presented on the tool tip when you select a continent, and the following statistics show for that continent: - Continent Name - Customer Total - Percentage of Products Sold - Percentage of Total Customers - Percentage of Total Transactions - Percentage of Total Revenue

  5. m

    Coronavirus Panoply.io for Database Warehousing and Post Analysis using...

    • data.mendeley.com
    Updated Feb 4, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranav Pandya (2020). Coronavirus Panoply.io for Database Warehousing and Post Analysis using Sequal Language (SQL) [Dataset]. http://doi.org/10.17632/4gphfg5tgs.2
    Explore at:
    Dataset updated
    Feb 4, 2020
    Authors
    Pranav Pandya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    It has never been easier to solve any database related problem using any sequel language and the following gives an opportunity for you guys to understand how I was able to figure out some of the interline relationships between databases using Panoply.io tool.

    I was able to insert coronavirus dataset and create a submittable, reusable result. I hope it helps you work in Data Warehouse environment.

    The following is list of SQL commands performed on dataset attached below with the final output as stored in Exports Folder QUERY 1 SELECT "Province/State" As "Region", Deaths, Recovered, Confirmed FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Deaths>0 Description: How will we estimate where Coronavirus has infiltrated, but there is effective recovery amongst patients? We can view those places by having Recovery twice more than the Death Toll.

    Query 2 SELECT country, sum(confirmed) as "Confirmed Count", sum(Recovered) as "Recovered Count", sum(Deaths) as "Death Toll" FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Confirmed>0 GROUP BY country

    Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries

    Query 3 SELECT country as "Countries where Coronavirus has reached" FROM "public"."coronavirus_updated" WHERE confirmed>0 GROUP BY country Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries.

    Query 4 SELECT country, sum(suspected) as "Suspected Cases under potential CoronaVirus outbreak" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 AND confirmed=0 GROUP BY country ORDER BY sum(suspected) DESC

    Description: Coronavirus is spreading at alarming rate. In order to know which countries are newly getting the virus is important because in these countries if timely measures are taken, it could prevent any causalities. Here is a list of suspected cases with no virus resulted deaths.

    Query 5 SELECT country, sum(suspected) as "Coronavirus uncontrolled spread count and human life loss", 100*sum(suspected)/(SELECT sum((suspected)) FROM "public"."coronavirus_updated") as "Global suspected Exposure of Coronavirus in percentage" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 GROUP BY country ORDER BY sum(suspected) DESC Description: Coronavirus is getting stronger in particular countries, but how will we measure that? We can measure it by knowing the percentage of suspected patients amongst countries which still doesn’t have any Coronavirus related deaths. The following is a list.

    Data Provided by: SRK, Data Scientist at H2O.ai, Chennai, India

  6. Statewide Commercial Baseline Study of New York Penetration and Saturation...

    • splitgraph.com
    • data.ny.gov
    • +1more
    Updated Jul 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    New York State Energy Research and Development Authority (NYSERDA) (2024). Statewide Commercial Baseline Study of New York Penetration and Saturation of Energy Using Equipment: 2019 [Dataset]. https://www.splitgraph.com/ny-gov/statewide-commercial-baseline-study-of-new-york-umaq-yp6d
    Explore at:
    application/openapi+json, json, application/vnd.splitgraph.imageAvailable download formats
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    New York State Energy Research and Development Authorityhttps://www.nyserda.ny.gov/
    Authors
    New York State Energy Research and Development Authority (NYSERDA)
    Area covered
    New York
    Description

    This dataset includes all Statewide Commercial Baseline Study summary statistics related to the estimation of population penetration and saturation estimates. These include summaries of the number of survey respondents asked each equation, the number of survey respondents who provided a valid answer, the unweighted penetration, weighted penetration, and adjusted and weighted penetration. All supporting summary statistics are also provided. Penetration refers to the proportion of businesses that have one or more of a particular piece of equipment. Saturation is a number representing how many of a particular piece of equipment are present, on average, among all businesses. The overall objective of the Statewide Commercial Baseline research was to understand the existing commercial building stock in New York State and associated energy use, including the penetration and saturation of energy consuming equipment (electric, natural gas, and other fuels). For more information, see the Final Report at https://www.nyserda.ny.gov/About/Publications/Building-Stock-and-Potential-Studies/Commercial-Statewide-Baseline-Study.

    NYSERDA offers objective information and analysis, innovative programs, technical expertise, and support to help New Yorkers increase energy efficiency, save money, use renewable energy, and accelerate economic growth. reduce reliance on fossil fuels. To learn more about NYSERDA’s programs, visit nyserda.ny.gov or follow us on X, Facebook, YouTube, or Instagram.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  7. S

    City-Level Descriptive Statistics for GHG Inventory

    • splitgraph.com
    • data.kcmo.org
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jerry Shechter (2023). City-Level Descriptive Statistics for GHG Inventory [Dataset]. https://www.splitgraph.com/kcmo/citylevel-descriptive-statistics-for-ghg-inventory-u9uw-758m
    Explore at:
    json, application/vnd.splitgraph.image, application/openapi+jsonAvailable download formats
    Dataset updated
    Dec 28, 2023
    Dataset authored and provided by
    Jerry Shechter
    Description

    This data set contains community statistics that were used to calculate greenhouse gas emissions (GHG) for the purposes of the 2013 GHG inventory.

    Data sources include US Census Bureau, Mid-America Regional Council (MARC), Jackson County Assessor Office, KCP&L electric company, Missouri Gas/Laclede gas company, Federal Highway Administration Office of Highway Policy Information Highway Statistics Series, Climate Action and Climate Protection Software notes, Kansas City Area Transit Authority (KCATA), EPA flight and large emitter website (http://ghgdata.epa.gov), City of Kansas City PUblic Words and Water Services Departments

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  8. a

    SRP All COMPASS GW Site Summary in New Jersey

    • share-open-data-njtpa.hub.arcgis.com
    • njogis-newjersey.opendata.arcgis.com
    • +1more
    Updated Jul 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NJDEP Bureau of GIS (2025). SRP All COMPASS GW Site Summary in New Jersey [Dataset]. https://share-open-data-njtpa.hub.arcgis.com/datasets/njdep::srp-all-compass-gw-site-summary-in-new-jersey
    Explore at:
    Dataset updated
    Jul 8, 2025
    Dataset authored and provided by
    NJDEP Bureau of GIS
    Area covered
    Description

    This GIS layer is based on a SQL query of the groundwater HAZSITE data that resides in COMPASS for each active Site Remediation case. Once the raw groundwater HAZSITE data is extracted from COMPASS, it is summarized such that a maximum concentration for the contaminant is derived for the year preceeding the last sampling event (samp_last_max_conc) and a maximum concentration is also generated for all sampling events (all_max_conc) . Each active Site Remediation case is included in the GIS layer. For the HAZSITE data, there are a number of considerations that need to be taken into account when using this GIS layer for decision making purposes:- Not all SRP cases have provided HAZSITE data to the Department or HAZSITE data that has been provided to the Department may be incomplete;- Additional sampling may have been conducted since the last round of HAZSITE data was submitted that has not yet been provided as HAZSITE data is only required with key document submittals;- HAZSITE data that was submitted may not have been provided in the correct format and therefore could not be uploaded into the COMPASS data repository and would therefore not be returned via the COMPASS SQL query.

  9. S

    NHIS Adult Summary Health Statistics

    • splitgraph.com
    • healthdata.gov
    • +3more
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NCHS/DHIS (2024). NHIS Adult Summary Health Statistics [Dataset]. https://www.splitgraph.com/cdc-gov/nhis-adult-summary-health-statistics-25m4-6qqq
    Explore at:
    json, application/openapi+json, application/vnd.splitgraph.imageAvailable download formats
    Dataset updated
    Jul 22, 2024
    Dataset authored and provided by
    NCHS/DHIS
    Description

    Interactive Summary Health Statistics for Adults provide annual estimates of selected health topics for adults aged 18 years and over based on final data from the National Health Interview Survey.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  10. S

    Independent dispute resolution summary data

    • splitgraph.com
    • data.texas.gov
    • +2more
    Updated Oct 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    texas-gov (2024). Independent dispute resolution summary data [Dataset]. https://www.splitgraph.com/texas-gov/independent-dispute-resolution-summary-data-bn27-65ad
    Explore at:
    application/vnd.splitgraph.image, application/openapi+json, jsonAvailable download formats
    Dataset updated
    Oct 15, 2024
    Authors
    texas-gov
    Description

    The Texas Department of Insurance administers Independent Dispute Resolution (IDR), a mediation and arbitration process for certain health care billing disputes between out-of-network providers and health plans. Mediation is used for billing disputes between out-of-network facilities and health plans. Arbitration is used for billing disputes between out-of-network health care providers (not facilities) and health plans. Medical services or supplies received on or after January 1, 2020 may be eligible for IDR. To learn more, go to the TDI webpage, Mediation and arbitration of medical bills.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  11. (Sunset)📒 Meta Kaggle ported to MS SQL SERVER

    • kaggle.com
    zip
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2024). (Sunset)📒 Meta Kaggle ported to MS SQL SERVER [Dataset]. https://www.kaggle.com/datasets/bwandowando/meta-kaggle-ported-to-sql-server-2022-database
    Explore at:
    zip(8635902534 bytes)Available download formats
    Dataset updated
    Mar 20, 2024
    Authors
    BwandoWando
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    I've always wanted to explore Kaggle's Meta Kaggle dataset but I am more comfortable on using TSQL when it comes to writing (very) complex queries. Also, I tend to write queries faster when using SQL MANAGEMENT STUDIO, like 100x faster. So, I ported Kaggle's Meta Kaggle dataset into MS SQL SERVER 2022 database format, created a backup file, then uploaded it here.

    • MSSQL VERSION: SQL Server 2022
    • Collation: SQL_Latin1_General_CP1_CI_AS
    • Recovery model: simple

    Requirements

    • Download and install the SQL SERVER 2022 Developer edition here
    • Download the backup file
    • Restore the backup file into your local. If you havent done this before, it's easy and straightforward. Here is a guide.

    (QUOTED FROM THE ORIGINAL DATASET)

    Meta Kaggle

    Explore Kaggle's public data on competitions, datasets, kernels (code/ notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but they think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F2ad97bce7839d6e57674e7a82981ed23%2F2Egeb8R.png?generation=1688912953875842&alt=media" alt="">

    Notes

  12. Steam Dataset 2025: Multi-Modal Gaming Analytics

    • kaggle.com
    zip
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CrainBramp (2025). Steam Dataset 2025: Multi-Modal Gaming Analytics [Dataset]. https://www.kaggle.com/datasets/crainbramp/steam-dataset-2025-multi-modal-gaming-analytics
    Explore at:
    zip(12478964226 bytes)Available download formats
    Dataset updated
    Oct 7, 2025
    Authors
    CrainBramp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Steam Dataset 2025: Multi-Modal Gaming Analytics Platform

    The first multi-modal Steam dataset with semantic search capabilities. 239,664 applications collected from official Steam Web APIs with PostgreSQL database architecture, vector embeddings for content discovery, and comprehensive review analytics.

    Made by a lifelong gamer for the gamer in all of us. Enjoy!🎮

    GitHub Repository https://github.com/vintagedon/steam-dataset-2025

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F4b7eb73ac0f2c3cc9f0d57f37321b38f%2FScreenshot%202025-10-18%20180450.png?generation=1760825194507387&alt=media" alt=""> 1024-dimensional game embeddings projected to 2D via UMAP reveal natural genre clustering in semantic space

    What Makes This Different

    Unlike traditional flat-file Steam datasets, this is built as an analytically-native database optimized for advanced data science workflows:

    ☑️ Semantic Search Ready - 1024-dimensional BGE-M3 embeddings enable content-based game discovery beyond keyword matching

    ☑️ Multi-Modal Architecture - PostgreSQL + JSONB + pgvector in unified database structure

    ☑️ Production Scale - 239K applications vs typical 6K-27K in existing datasets

    ☑️ Complete Review Corpus - 1,048,148 user reviews with sentiment and metadata

    ☑️ 28-Year Coverage - Platform evolution from 1997-2025

    ☑️ Publisher Networks - Developer and publisher relationship data for graph analysis

    ☑️ Complete Methodology & Infrastructure - Full work logs document every technical decision and challenge encountered, while my API collection scripts, database schemas, and processing pipelines enable you to update the dataset, fork it for customized analysis, learn from real-world data engineering workflows, or critique and improve the methodology

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F649e9f7f46c6ce213101d0948c89e8ac%2F4_price_distribution_by_top_10_genres.png?generation=1760824835918620&alt=media" alt=""> Market segmentation and pricing strategy analysis across top 10 genres

    What's Included

    Core Data (CSV Exports): - 239,664 Steam applications with complete metadata - 1,048,148 user reviews with scores and statistics - 13 normalized relational tables for pandas/SQL workflows - Genre classifications, pricing history, platform support - Hardware requirements (min/recommended specs) - Developer and publisher portfolios

    Advanced Features (PostgreSQL): - Full database dump with optimized indexes - JSONB storage preserving complete API responses - Materialized columns for sub-second query performance - Vector embeddings table (pgvector-ready)

    Documentation: - Complete data dictionary with field specifications - Database schema documentation - Collection methodology and validation reports

    Example Analysis: Published Notebooks (v1.0)

    Three comprehensive analysis notebooks demonstrate dataset capabilities. All notebooks render directly on GitHub with full visualizations and output:

    📊 Platform Evolution & Market Landscape

    View on GitHub | PDF Export
    28 years of Steam's growth, genre evolution, and pricing strategies.

    🔍 Semantic Game Discovery

    View on GitHub | PDF Export
    Content-based recommendations using vector embeddings across genre boundaries.

    🎯 The Semantic Fingerprint

    View on GitHub | PDF Export
    Genre prediction from game descriptions - demonstrates text analysis capabilities.

    Notebooks render with full output on GitHub. Kaggle-native versions planned for v1.1 release. CSV data exports included in dataset for immediate analysis.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F4079e43559d0068af00a48e2c31f0f1d%2FScreenshot%202025-10-18%20180214.png?generation=1760824950649726&alt=media" alt=""> *Steam platfor...

  13. d

    Health and Retirement Study (HRS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D

  14. S

    Census Demographics

    • splitgraph.com
    • data.brla.gov
    • +3more
    Updated Dec 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Information Services (2021). Census Demographics [Dataset]. https://www.splitgraph.com/brla-gov/census-demographics-xsrb-mxqt/
    Explore at:
    json, application/vnd.splitgraph.image, application/openapi+jsonAvailable download formats
    Dataset updated
    Dec 15, 2021
    Dataset authored and provided by
    Information Services
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Summary statistics from the 2000 and 2010 United States Census including population, demographics, education, and housing information for each block group in East Baton Rouge Parish, Louisiana.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  15. Statewide Commercial Baseline Study of New York Means of Energy Using...

    • splitgraph.com
    • data.ny.gov
    • +1more
    Updated Jul 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    New York State Energy Research and Development Authority (NYSERDA) (2024). Statewide Commercial Baseline Study of New York Means of Energy Using Equipment: 2019 [Dataset]. https://www.splitgraph.com/ny-gov/statewide-commercial-baseline-study-of-new-york-ttu3-cutd
    Explore at:
    application/vnd.splitgraph.image, application/openapi+json, jsonAvailable download formats
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    New York State Energy Research and Development Authorityhttps://www.nyserda.ny.gov/
    Authors
    New York State Energy Research and Development Authority (NYSERDA)
    Area covered
    New York
    Description

    The overall objective of the Statewide Commercial Baseline research was to understand the existing commercial building stock in New York State and associated energy use, including the means of energy using equipment. This dataset provides all characteristics that are presented as averages, such as the average square footage of businesses or the average cooling capacity of split systems. All supporting summary statistics are also provided. For more information, see the Final Report at https://www.nyserda.ny.gov/About/Publications/Building-Stock-and-Potential-Studies/Commercial-Statewide-Baseline-Study

    NYSERDA offers objective information and analysis, innovative programs, technical expertise, and support to help New Yorkers increase energy efficiency, save money, use renewable energy, and accelerate economic growth. reduce reliance on fossil fuels. To learn more about NYSERDA’s programs, visit nyserda.ny.gov or follow us on X, Facebook, YouTube, or Instagram.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  16. S

    NHIS Adult 3-Year Summary Health Statistics

    • splitgraph.com
    • data.virginia.gov
    • +2more
    Updated Mar 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NCHS/DHIS (2023). NHIS Adult 3-Year Summary Health Statistics [Dataset]. https://www.splitgraph.com/cdc-gov/nhis-adult-3year-summary-health-statistics-krhz-spsc
    Explore at:
    json, application/openapi+json, application/vnd.splitgraph.imageAvailable download formats
    Dataset updated
    Mar 30, 2023
    Dataset authored and provided by
    NCHS/DHIS
    Description

    Interactive Summary Health Statistics for Adults, by Detailed Race and Ethnicity provide estimates as three-year averages of selected health topics for adults aged 18 years and over based on final data from the National Health Interview Survey.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  17. 2021 Final Assisted Reproductive Technology (ART) Summary

    • splitgraph.com
    • data.virginia.gov
    • +3more
    Updated Sep 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Disease Control and Prevention National Center for Chronic Disease Prevention and Health Promotion Division of Reproductive Health (DRH) (2024). 2021 Final Assisted Reproductive Technology (ART) Summary [Dataset]. https://www.splitgraph.com/cdc-gov/2021-final-assisted-reproductive-technology-art-9tjt-seye
    Explore at:
    json, application/openapi+json, application/vnd.splitgraph.imageAvailable download formats
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    Centers for Disease Control and Preventionhttp://www.cdc.gov/
    Authors
    Centers for Disease Control and Prevention National Center for Chronic Disease Prevention and Health Promotion Division of Reproductive Health (DRH)
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Data were updated on September 11, 2024.

    ART data are made available as part of the National ART Surveillance System (NASS) that collects success rates, services, profiles and annual summary data from fertility clinics across the U.S. There are four datasets available: ART Services and Profiles, ART Patient and Cycle Characteristics, ART Success Rates, and ART Summary. All four datasets may be linked by “ClinicID.” ClinicID is a unique identifier for each clinic that reported cycles. The Summary dataset provides a full snapshot of clinic services and profile, patient characteristics, and ART success rates. It is worth noting that patient medical characteristics, such as age, diagnosis, and ovarian reserve, affect ART treatment’s success. Comparison of success rates across clinics may not be meaningful because of differences in patient populations and ART treatment methods. The success rates displayed in this dataset do not reflect any one patient’s chance of success. Patients should consult with a doctor to understand their chance of success based on their own characteristics.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  18. S

    SQL In-Memory Database Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). SQL In-Memory Database Report [Dataset]. https://www.archivemarketresearch.com/reports/sql-in-memory-database-28161
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Feb 15, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The SQL In-Memory Database market is projected to witness significant growth in the coming years, driven by the increasing need for real-time data processing and analytics. The rise of big data and the Internet of Things (IoT) has led to an explosion of data, making it essential for businesses to have the ability to quickly and efficiently process and analyze data in order to gain actionable insights. SQL In-Memory Databases, which store data in memory rather than on disk, offer superior performance and speed, making them ideal for handling large and complex datasets in real-time. The growing adoption of cloud computing is another factor contributing to the growth of the SQL In-Memory Database market. Cloud-based SQL In-Memory Databases offer a number of advantages, including scalability, flexibility, and cost-effectiveness. They allow businesses to easily scale their database up or down as needed, and they eliminate the need for expensive hardware and maintenance costs. As a result, cloud-based SQL In-Memory Databases are becoming increasingly popular with businesses of all sizes.

  19. S

    Injury/Illness Summary - Operational Data (Form 55)

    • splitgraph.com
    • data.transportation.gov
    • +2more
    Updated Oct 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    datahub-transportation-gov (2024). Injury/Illness Summary - Operational Data (Form 55) [Dataset]. https://www.splitgraph.com/datahub-transportation-gov/injuryillness-summary-operational-data-form-55-m8i6-zdsy/
    Explore at:
    application/openapi+json, json, application/vnd.splitgraph.imageAvailable download formats
    Dataset updated
    Oct 5, 2024
    Authors
    datahub-transportation-gov
    Description

    This dataset is in a user-friendly human-readable format. To download the source dataset that contains raw data values, go here: https://data.transportation.gov/dataset/Form-55-Source-Table/unww-uhxd.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  20. D

    Data Modeling Tool Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Nov 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Data Modeling Tool Report [Dataset]. https://www.datainsightsmarket.com/reports/data-modeling-tool-1455486
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Nov 9, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Data Modeling Tool market is experiencing robust expansion, projected to reach an estimated value of $3,500 million by 2025, with a Compound Annual Growth Rate (CAGR) of approximately 12% anticipated from 2025 to 2033. This significant growth is fueled by the escalating need for efficient data management and architectural design across all business sizes. Small and Medium-sized Enterprises (SMEs) are increasingly adopting these tools to streamline their database development and enhance data integrity, moving from manual processes to more sophisticated modeling. Large enterprises, on the other hand, leverage advanced data modeling capabilities for complex data warehousing, big data analytics, and ensuring compliance with evolving data governance regulations. The ongoing digital transformation initiatives worldwide, coupled with the growing volume and complexity of data, are primary drivers for this market. Furthermore, the increasing demand for cloud-based solutions, offering scalability, accessibility, and cost-effectiveness, is reshaping the deployment landscape, with cloud-based models showing a stronger trajectory compared to on-premises solutions. The market dynamics are further shaped by several key trends. The integration of Artificial Intelligence (AI) and Machine Learning (ML) into data modeling tools is emerging as a significant differentiator, enabling automated schema generation, anomaly detection, and predictive data quality analysis. This enhances user productivity and accuracy. Collaboration features are also gaining prominence, allowing distributed teams to work seamlessly on database designs. However, the market faces certain restraints, including the initial cost of sophisticated tools, the need for specialized expertise to utilize advanced features effectively, and potential resistance to change from organizations accustomed to legacy systems. The competitive landscape is characterized by a mix of established players like IBM, Oracle, and SAP, alongside innovative niche providers such as Vertabelo, SQL Database Modeler, and Archi, all vying for market share through continuous product development and strategic partnerships. The Asia Pacific region, driven by rapid economic growth and widespread digital adoption in countries like China and India, is expected to be a significant growth engine for the data modeling tool market. This comprehensive report delves into the dynamic global Data Modeling Tool market, offering an in-depth analysis from the Historical Period of 2019-2024 through to the Forecast Period of 2025-2033, with 2025 serving as the Base Year and Estimated Year. We project the market to reach substantial valuations, with an estimated market size of $5.2 billion in 2025, and forecast a Compound Annual Growth Rate (CAGR) of 12.5%, pushing the market value to an impressive $13.8 billion by 2033. The study meticulously examines key market drivers, challenges, trends, and opportunities, providing actionable insights for stakeholders.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Michael_Dsouza16 (2024). SQL analysis using pizass data set [Dataset]. https://www.kaggle.com/datasets/michaeldsouza16/sql-analysis-using-pizass-data-set
Organization logo

SQL analysis using pizass data set

A Comprehensive Dataset for Analyzing Pizza Sales, Orders, and Customer Preferen

Explore at:
zip(427330 bytes)Available download formats
Dataset updated
Jul 13, 2024
Authors
Michael_Dsouza16
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset is designed for SQL analysis exercises, providing comprehensive data on pizza sales, orders, and customer preferences. It includes details on order quantities, pizza types, and the composition of various pizzas. The dataset is ideal for practicing SQL queries, performing revenue analysis, and understanding customer behavior in the pizza industry.

  1. order_details.csv Description: Contains details of each pizza order. Columns: order_details_id: Unique identifier for the order detail. order_id: Identifier for the order. pizza_id: Identifier for the pizza type. quantity: Number of pizzas ordered

  2. pizza_types.csv Description: Provides information on different types of pizzas available. Columns: pizza_type_id: Unique identifier for the pizza type. name: Name of the pizza. category: Category of the pizza (e.g., Chicken, Vegetarian). ingredients: List of ingredients used in the pizza.

  3. Questions.txt Description: Contains various SQL questions for analyzing the dataset. Contents: Basic: Retrieve the total number of orders placed. Calculate the total revenue generated from pizza sales. Identify the highest-priced pizza. Identify the most common pizza size ordered. List the top 5 most ordered pizza types along with their quantities.

Search
Clear search
Close search
Google apps
Main menu