6 datasets found
  1. Z

    CC20 Artifact - Automatic Fusion

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Krolik (2020). CC20 Artifact - Automatic Fusion [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3608382
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Alexander Krolik
    Clark Verbrugge
    Laurie Hendren
    Bettina Kemme
    Hanfeng Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. Getting started

    The title of our paper submitted to CC20 is

    Improving Database Query Performance with Automatic Fusion

    This repository is created for showing the reproducibility of our experiments in this paper. We provide the details of scripts and original data used in the experiments. There are mainly two systems: HorsePower and RDBMS MonetDB. We supply step-by-step instructions to configure and deploy both systems in the experiments.

    On this page, you will see:

    how to run experiments (Section 2); and

    the results used in the paper (Section 3);

    1. Experiments

    All experiments were run on a server called sable-intel equipped with

    Ubuntu 16.04.6 LTS (64-bit)

    4 Intel Xeon E7-4850 2.00 GHz

    total 40 cores with 80 threads

    128GB RAM

    Docker setup

    Download the docker image: cc20-docker.tar (About 13GB)

    docker load < cc20-docker.tar

    Generate a named container (then exit)

    docker run --hostname sableintel -it --name=container-cc20 wukefe/cc20-docker exit

    Then, you can run the container

    docker start -ai container-cc20

    Open a new terminal to access the container (optional)

    docker exec -it container-cc20 /bin/bash

    Introduction to MonetDB

    Work directory for MonetDB

    /home/hanfeng/cc20/monetdb

    Start MonetDB (use all available threads)

    ./run.sh start

    Login MonetDB using its client tool, mclient

    mclient -d tpch1

    ... MonetDB version v11.33.3 (Apr2019)

    sql> SELECT 'Hello world'; +-------------+ | L2 | +=============+ | Hello world | +-------------+ 1 tuple

    Show the list of tables in the current database

    sql> \d TABLE sys.customer TABLE sys.lineitem TABLE sys.nation TABLE sys.orders TABLE sys.part TABLE sys.partsupp TABLE sys.region TABLE sys.supplier

    Leave the session

    sql> \q

    Stop MonetDB before we can continue our experiments

    ./run.sh stop

    Reference: How to install MonetDB and the introduction of server and client programs.

    Run MonetDB with TPC-H queries

    MonetDB: server mode

    Invoke MonetDB with a specific number of threads (e.g. 1)

    mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=1

    Open a new terminal

    docker exec -it container-cc20 /bin/bash cd cc20/monetdb

    Note: Type \q to exit the server mode.

    Run with a specific number of threads (Two terminals required)

    1 thread

    terminal 1

    mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=1

    terminal 2

    (time ./runtest | mclient -d tpch1) &> "log/log_thread_1.log"

    2 threads

    terminal 1

    mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=2

    terminal 2

    (time ./runtest | mclient -d tpch1) &> "log/log_thread_2.log"

    4 threads

    terminal 1

    mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=4

    terminal 2

    (time ./runtest | mclient -d tpch1) &> "log/log_thread_4.log"

    8 threads

    terminal 1

    mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=8

    terminal 2

    (time ./runtest | mclient -d tpch1) &> "log/log_thread_8.log"

    16 threads

    terminal 1

    mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=16

    terminal 2

    (time ./runtest | mclient -d tpch1) &> "log/log_thread_16.log"

    32 threads

    terminal 1

    mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=32

    terminal 2

    (time ./runtest | mclient -d tpch1) &> "log/log_thread_32.log"

    64 threads

    terminal 1

    mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=64

    terminal 2

    (time ./runtest | mclient -d tpch1) &> "log/log_thread_64.log"

    Post data processing - MonetDB

    Fetch average execution time (ms)

    grep -A 3 avg_query log/log_thread_1.log | python cut.py

    699.834133333 // q1 85.9178666667 // q4 65.0172 // q6 101.730666667 // q12 58.212 // q14 60.1138666667 // q16 248.926466667 // q19 77.6482 // q22

    grep -A 3 avg_query log/log_thread_2.log | python cut.py grep -A 3 avg_query log/log_thread_4.log | python cut.py grep -A 3 avg_query log/log_thread_8.log | python cut.py grep -A 3 avg_query log/log_thread_16.log | python cut.py grep -A 3 avg_query log/log_thread_32.log | python cut.py grep -A 3 avg_query log/log_thread_64.log | python cut.py

    Note: The above numbers can be copied to an Excel file for further analysis before plotting figures. Details can be found in Section 3.

    Run with HorseIR

    The HorsePower project can be found on GitHub. In the docker image, it has been placed in /home/hanfeng/cc20/horse.

    https://github.com/Sable/HorsePower

    Execution time

    We then run each query 15 times to get the average execution time (ms).

    (cd /home/hanfeng/cc20/horse/ && time ./run_all.sh)

    The script run_all.sh runs over three versions of generated C code based on different levels of optimizations.

    • naive : no optimization
    • opt1 : with optimizations
    • opt2 : with automatic fusion

    In each version, it first compiles its C code and runs the generated binary with a different number of threads (i.e. 1/2/4/8/16/32/64). Each run computes a query 15 times and returns the average.

    As a result, all output is saved into a log file, for example, log/naive/log_q6.log contains the result of query 6 in the naive version with all different number of threads.

    Log file structures

    log/naive/*.txt log/opt1/*.txt log/opt2/*.txt

    Fetch a brief summary of execution time from a log file

    cat log/naive/log_q6.txt | grep -E 'Run with 15 times'

    q06>> Run with 15 times, last 15 average (ms): 266.638 | 278.999 266.134 266.417 <12 more> # 1 thread q06>> Run with 15 times, last 15 average (ms): 138.556 | 144.474 137.837 137.579 <12 more> # 2 threads q06>> Run with 15 times, last 15 average (ms): 71.8851 | 75.339 72.102 72.341 <12 more> # 4 threads q06>> Run with 15 times, last 15 average (ms): 73.111 | 75.867 72.53 72.936 <12 more> # 8 threads q06>> Run with 15 times, last 15 average (ms): 56.1003 | 59.263 56.057 56.039 <12 more> # 16 threads q06>> Run with 15 times, last 15 average (ms): 56.8858 | 59.466 56.651 57.109 <12 more> # 32 threads q06>> Run with 15 times, last 15 average (ms): 53.4254 | 55.884 54.457 52.878 <12 more> # 64 threads

    It may become verbose when you have to extract information for all queries over three different kinds of versions. We provide a simple solution for it.

    ./run.sh fetch log | python gen_for_copy.py

    Output data in the following format

    // query id

    | naive | opt1 | opt2 |

    | ... | ... | ... | # 1 thread | ... | ... | ... | # 2 threads ... ... ... | ... | ... | ... | # 64 threads

    Note that we copy the generated numbers into an Excel described in Section 3. Within an Excel file, we compare the performance difference in MonetDB and different versions of the generated C code.

    Compilation time

    Work directory

    /home/hanfeng/cc20/horse/codegen

    Fetch compilation time for different kinds of C code

    ./run.sh compile naive &> log_cc20_compile_naive.txt ./run.sh compile opt1 &> log_cc20_compile_opt1.txt ./run.sh compile opt2 &> log_cc20_compile_opt2.txt

    Let's look into the result of query 1 in the log file log_cc20_compile_naive.txt.

    Time variable usr sys wall GGC phase setup : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 1266 kB ( 18%) phase parsing : 0.07 ( 54%) 0.07 ( 88%) 0.14 ( 64%) 3897 kB ( 55%) phase opt and generate : 0.06 ( 46%) 0.01 ( 12%) 0.07 ( 32%) 1899 kB ( 27%) dump files : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 9%) 0 kB ( 0%) df reg dead/unused notes : 0.01 ( 8%) 0.00 ( 0%) 0.00 ( 0%) 31 kB ( 0%) register information : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 0 kB ( 0%) preprocessing : 0.03 ( 23%) 0.02 ( 25%) 0.08 ( 36%) 1468 kB ( 21%) lexical analysis : 0.00 ( 0%) 0.03 ( 38%) 0.05 ( 23%) 0 kB ( 0%) parser (global) : 0.04 ( 31%) 0.02 ( 25%) 0.01 ( 5%) 2039 kB ( 29%) tree SSA other : 0.00 ( 0%) 0.01 ( 12%) 0.00 ( 0%) 3 kB ( 0%) integrated RA : 0.01 ( 8%) 0.00 ( 0%) 0.01 ( 5%) 726 kB ( 10%) thread pro- & epilogue : 0.02 ( 15%) 0.00 ( 0%) 0.00 ( 0%) 41 kB ( 1%) shorten branches : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 0 kB ( 0%) final : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 56 kB ( 1%) initialize rtl : 0.01 ( 8%) 0.00 ( 0%) 0.01 ( 5%) 12 kB ( 0%) rest of compilation : 0.01 ( 8%) 0.00 ( 0%) 0.00 ( 0%) 62 kB ( 1%) TOTAL : 0.13 0.08 0.22 7072 kB

    The whole compilation time is split into many parts. We take the total wall time as the actual time spent on the code compilation. In this query, it needs 0.22 seconds to complete the whole compilation. (Note that manual work is required for retrieving the compilation time.)

    3.

  2. Data from: Open-data release of aggregated Australian school-level...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, txt
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Monteiro Lobato,; Monteiro Lobato, (2020). Open-data release of aggregated Australian school-level information. Edition 2016.1 [Dataset]. http://doi.org/10.5281/zenodo.46086
    Explore at:
    csv, bin, txtAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Monteiro Lobato,; Monteiro Lobato,
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The file set is a freely downloadable aggregation of information about Australian schools. The individual files represent a series of tables which, when considered together, form a relational database. The records cover the years 2008-2014 and include information on approximately 9500 primary and secondary school main-campuses and around 500 subcampuses. The records all relate to school-level data; no data about individuals is included. All the information has previously been published and is publicly available but it has not previously been released as a documented, useful aggregation. The information includes:
    (a) the names of schools
    (b) staffing levels, including full-time and part-time teaching and non-teaching staff
    (c) student enrolments, including the number of boys and girls
    (d) school financial information, including Commonwealth government, state government, and private funding
    (e) test data, potentially for school years 3, 5, 7 and 9, relating to an Australian national testing programme know by the trademark 'NAPLAN'

    Documentation of this Edition 2016.1 is incomplete but the organization of the data should be readily understandable to most people. If you are a researcher, the simplest way to study the data is to make use of the SQLite3 database called 'school-data-2016-1.db'. If you are unsure how to use an SQLite database, ask a guru.

    The database was constructed directly from the other included files by running the following command at a command-line prompt:
    sqlite3 school-data-2016-1.db < school-data-2016-1.sql
    Note that a few, non-consequential, errors will be reported if you run this command yourself. The reason for the errors is that the SQLite database is created by importing a series of '.csv' files. Each of the .csv files contains a header line with the names of the variable relevant to each column. The information is useful for many statistical packages but it is not what SQLite expects, so it complains about the header. Despite the complaint, the database will be created correctly.

    Briefly, the data are organized as follows.
    (a) The .csv files ('comma separated values') do not actually use a comma as the field delimiter. Instead, the vertical bar character '|' (ASCII Octal 174 Decimal 124 Hex 7C) is used. If you read the .csv files using Microsoft Excel, Open Office, or Libre Office, you will need to set the field-separator to be '|'. Check your software documentation to understand how to do this.
    (b) Each school-related record is indexed by an identifer called 'ageid'. The ageid uniquely identifies each school and consequently serves as the appropriate variable for JOIN-ing records in different data files. For example, the first school-related record after the header line in file 'students-headed-bar.csv' shows the ageid of the school as 40000. The relevant school name can be found by looking in the file 'ageidtoname-headed-bar.csv' to discover that the the ageid of 40000 corresponds to a school called 'Corpus Christi Catholic School'.
    (3) In addition to the variable 'ageid' each record is also identified by one or two 'year' variables. The most important purpose of a year identifier will be to indicate the year that is relevant to the record. For example, if one turn again to file 'students-headed-bar.csv', one sees that the first seven school-related records after the header line all relate to the school Corpus Christi Catholic School with ageid of 40000. The variable that identifies the important differences between these seven records is the variable 'studentyear'. 'studentyear' shows the year to which the student data refer. One can see, for example, that in 2008, there were a total of 410 students enrolled, of whom 185 were girls and 225 were boys (look at the variable names in the header line).
    (4) The variables relating to years are given different names in each of the different files ('studentsyear' in the file 'students-headed-bar.csv', 'financesummaryyear' in the file 'financesummary-headed-bar.csv'). Despite the different names, the year variables provide the second-level means for joining information acrosss files. For example, if you wanted to relate the enrolments at a school in each year to its financial state, you might wish to JOIN records using 'ageid' in the two files and, secondarily, matching 'studentsyear' with 'financialsummaryyear'.
    (5) The manipulation of the data is most readily done using the SQL language with the SQLite database but it can also be done in a variety of statistical packages.
    (6) It is our intention for Edition 2016-2 to create large 'flat' files suitable for use by non-researchers who want to view the data with spreadsheet software. The disadvantage of such 'flat' files is that they contain vast amounts of redundant information and might not display the data in the form that the user most wants it.
    (7) Geocoding of the schools is not available in this edition.
    (8) Some files, such as 'sector-headed-bar.csv' are not used in the creation of the database but are provided as a convenience for researchers who might wish to recode some of the data to remove redundancy.
    (9) A detailed example of a suitable SQLite query can be found in the file 'school-data-sqlite-example.sql'. The same query, used in the context of analyses done with the excellent, freely available R statistical package (http://www.r-project.org) can be seen in the file 'school-data-with-sqlite.R'.

  3. u

    University of Cape Town Student Admissions Data 2006-2014 - South Africa

    • datafirst.uct.ac.za
    Updated Jul 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCT Student Administration (2020). University of Cape Town Student Admissions Data 2006-2014 - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/556
    Explore at:
    Dataset updated
    Jul 28, 2020
    Dataset authored and provided by
    UCT Student Administration
    Time period covered
    2006 - 2014
    Area covered
    South Africa
    Description

    Abstract

    This dataset was generated from a set of Excel spreadsheets from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). This database contains information on applications to UCT between the January 2006 and December 2014. In the original form received by DataFirst the data were ill suited to research purposes. This dataset represents an attempt at cleaning and organizing these data into a more tractable format. To ensure data confidentiality direct identifiers have been removed from the data and the data is only made available to accredited researchers through DataFirst's Secure Data Service.

    The dataset was separated into the following data files:

    1. Application level information: the "finest" unit of analysis. Individuals may have multiple applications. Uniquely identified by an application ID variable. There are a total of 1,714,669 applications on record.
    2. Individual level information: individuals may have multiple applications. Each individual is uniquely identified by an individual ID variable. Each individual is associated with information on "key subjects" from a separate data file also contained in the database. These key subjects are all separate variables in the individual level data file. There are a total of 285,005 individuals on record.
    3. Secondary Education Information: individuals can also be associated with row entries for each subject. This data file does not have a unique identifier. Instead, each row entry represents a specific secondary school subject for a specific individual. These subjects are quite specific and the data allows the user to distinguish between, for example, higher grade accounting and standard grade accounting. It also allows the user to identify the educational authority issuing the qualification e.g. Cambridge Internal Examinations (CIE) versus National Senior Certificate (NSC).
    4. Tertiary Education Information: the smallest of the four data files. There are multiple entries for each individual in this dataset. Each row entry contains information on the year, institution and transcript information and can be associated with individuals.

    Analysis unit

    Applications, individuals

    Kind of data

    Administrative records [adm]

    Mode of data collection

    Other [oth]

    Cleaning operations

    The data files were made available to DataFirst as a group of Excel spreadsheet documents from an SQL database managed by the University of Cape Town's Information and Communication Technology Services . The process of combining these original data files to create a research-ready dataset is summarised in a document entitled "Notes on preparing the UCT Student Application Data 2006-2014" accompanying the data.

  4. S

    LADOT Parking Meter Occupancy - Archive

    • splitgraph.com
    • data.lacity.org
    • +2more
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LADOT Parking Meters Division - LA ExpressPark (2024). LADOT Parking Meter Occupancy - Archive [Dataset]. https://www.splitgraph.com/lacity/ladot-parking-meter-occupancy-archive-cj8s-ivry/
    Explore at:
    json, application/vnd.splitgraph.image, application/openapi+jsonAvailable download formats
    Dataset updated
    Oct 2, 2024
    Dataset authored and provided by
    LADOT Parking Meters Division - LA ExpressPark
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Monthly archive of all parking meter sensor activity over the previous 36 months (3 years). Updated monthly for data 2 months prior (eg. January data will be published early March).

    For best-available current "live" status, see "LADOT Parking Meter Occupancy".

    For location and parking policy details, see "LADOT Metered Parking Inventory & Policies".

    • This dataset is geared towards database professionals and/or app developers. Each file is extremely large, over 300MB at minimum. Common applications like Microsoft Excel will not be able to open the file and show all data.

    ** For best results, import into a database or use advanced data access methods appropriate for processing large files.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  5. O*NET Database

    • onetcenter.org
    excel, mysql, oracle +2
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Center for O*NET Development (2025). O*NET Database [Dataset]. https://www.onetcenter.org/database.html
    Explore at:
    oracle, sql server, text, mysql, excelAvailable download formats
    Dataset updated
    May 22, 2025
    Dataset provided by
    Occupational Information Network
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Dataset funded by
    United States Department of Laborhttp://www.dol.gov/
    Description

    The O*NET Database contains hundreds of standardized and occupation-specific descriptors on almost 1,000 occupations covering the entire U.S. economy. The database, which is available to the public at no cost, is continually updated by a multi-method data collection program. Sources of data include: job incumbents, occupational experts, occupational analysts, employer job postings, and customer/professional association input.

    Data content areas include:

    • Worker Characteristics (e.g., Abilities, Interests, Work Styles)
    • Worker Requirements (e.g., Education, Knowledge, Skills)
    • Experience Requirements (e.g., On-the-Job Training, Work Experience)
    • Occupational Requirements (e.g., Detailed Work Activities, Work Context)
    • Occupation-Specific Information (e.g., Job Titles, Tasks, Technology Skills)

  6. r

    Australian Public Holidays Dates Machine Readable Dataset

    • researchdata.edu.au
    Updated Feb 24, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of the Prime Minister and Cabinet (2014). Australian Public Holidays Dates Machine Readable Dataset [Dataset]. https://researchdata.edu.au/australian-public-holidays-readable-dataset/2995651
    Explore at:
    Dataset updated
    Feb 24, 2014
    Dataset provided by
    data.gov.au
    Authors
    Department of the Prime Minister and Cabinet
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Australia
    Description

    The Department of the Prime Minister and Cabinet is no longer maintaining this dataset. If you would like to take ownership of this dataset for ongoing maintenance please contact us.\r \r ---\r \r PLEASE READ BEFORE USING\r \r The data format has been updated to align with a tidy data style (http://vita.had.co.nz/papers/tidy-data.html).\r \r The data in this dataset is manually collected and combined in a csv format from the following state and territory portals:\r \r - https://www.cmtedd.act.gov.au/communication/holidays\r - https://www.nsw.gov.au/about-nsw/public-holidays\r - https://nt.gov.au/nt-public-holidays\r - https://www.qld.gov.au/recreation/travel/holidays/public\r - https://www.safework.sa.gov.au/resources/public-holidays\r - https://worksafe.tas.gov.au/topics/laws-and-compliance/public-holidays\r - https://business.vic.gov.au/business-information/public-holidays\r - https://www.commerce.wa.gov.au/labour-relations/public-holidays-western-australia\r \r The data API by default returns only the first 100 records. The JSON response will contain a key that shows the link for the next page of records.\r Alternatively you can view all records by updating the limit on the endpoint or using a query to select all records, i.e. /api/3/action/datastore_search_sql?sql=SELECT * from "{{resource_id}}".\r \r

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Alexander Krolik (2020). CC20 Artifact - Automatic Fusion [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3608382

CC20 Artifact - Automatic Fusion

Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Alexander Krolik
Clark Verbrugge
Laurie Hendren
Bettina Kemme
Hanfeng Chen
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description
  1. Getting started

The title of our paper submitted to CC20 is

Improving Database Query Performance with Automatic Fusion

This repository is created for showing the reproducibility of our experiments in this paper. We provide the details of scripts and original data used in the experiments. There are mainly two systems: HorsePower and RDBMS MonetDB. We supply step-by-step instructions to configure and deploy both systems in the experiments.

On this page, you will see:

how to run experiments (Section 2); and

the results used in the paper (Section 3);

  1. Experiments

All experiments were run on a server called sable-intel equipped with

Ubuntu 16.04.6 LTS (64-bit)

4 Intel Xeon E7-4850 2.00 GHz

total 40 cores with 80 threads

128GB RAM

Docker setup

Download the docker image: cc20-docker.tar (About 13GB)

docker load < cc20-docker.tar

Generate a named container (then exit)

docker run --hostname sableintel -it --name=container-cc20 wukefe/cc20-docker exit

Then, you can run the container

docker start -ai container-cc20

Open a new terminal to access the container (optional)

docker exec -it container-cc20 /bin/bash

Introduction to MonetDB

Work directory for MonetDB

/home/hanfeng/cc20/monetdb

Start MonetDB (use all available threads)

./run.sh start

Login MonetDB using its client tool, mclient

mclient -d tpch1

... MonetDB version v11.33.3 (Apr2019)

sql> SELECT 'Hello world'; +-------------+ | L2 | +=============+ | Hello world | +-------------+ 1 tuple

Show the list of tables in the current database

sql> \d TABLE sys.customer TABLE sys.lineitem TABLE sys.nation TABLE sys.orders TABLE sys.part TABLE sys.partsupp TABLE sys.region TABLE sys.supplier

Leave the session

sql> \q

Stop MonetDB before we can continue our experiments

./run.sh stop

Reference: How to install MonetDB and the introduction of server and client programs.

Run MonetDB with TPC-H queries

MonetDB: server mode

Invoke MonetDB with a specific number of threads (e.g. 1)

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=1

Open a new terminal

docker exec -it container-cc20 /bin/bash cd cc20/monetdb

Note: Type \q to exit the server mode.

Run with a specific number of threads (Two terminals required)

1 thread

terminal 1

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=1

terminal 2

(time ./runtest | mclient -d tpch1) &> "log/log_thread_1.log"

2 threads

terminal 1

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=2

terminal 2

(time ./runtest | mclient -d tpch1) &> "log/log_thread_2.log"

4 threads

terminal 1

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=4

terminal 2

(time ./runtest | mclient -d tpch1) &> "log/log_thread_4.log"

8 threads

terminal 1

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=8

terminal 2

(time ./runtest | mclient -d tpch1) &> "log/log_thread_8.log"

16 threads

terminal 1

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=16

terminal 2

(time ./runtest | mclient -d tpch1) &> "log/log_thread_16.log"

32 threads

terminal 1

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=32

terminal 2

(time ./runtest | mclient -d tpch1) &> "log/log_thread_32.log"

64 threads

terminal 1

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=64

terminal 2

(time ./runtest | mclient -d tpch1) &> "log/log_thread_64.log"

Post data processing - MonetDB

Fetch average execution time (ms)

grep -A 3 avg_query log/log_thread_1.log | python cut.py

699.834133333 // q1 85.9178666667 // q4 65.0172 // q6 101.730666667 // q12 58.212 // q14 60.1138666667 // q16 248.926466667 // q19 77.6482 // q22

grep -A 3 avg_query log/log_thread_2.log | python cut.py grep -A 3 avg_query log/log_thread_4.log | python cut.py grep -A 3 avg_query log/log_thread_8.log | python cut.py grep -A 3 avg_query log/log_thread_16.log | python cut.py grep -A 3 avg_query log/log_thread_32.log | python cut.py grep -A 3 avg_query log/log_thread_64.log | python cut.py

Note: The above numbers can be copied to an Excel file for further analysis before plotting figures. Details can be found in Section 3.

Run with HorseIR

The HorsePower project can be found on GitHub. In the docker image, it has been placed in /home/hanfeng/cc20/horse.

https://github.com/Sable/HorsePower

Execution time

We then run each query 15 times to get the average execution time (ms).

(cd /home/hanfeng/cc20/horse/ && time ./run_all.sh)

The script run_all.sh runs over three versions of generated C code based on different levels of optimizations.

  • naive : no optimization
  • opt1 : with optimizations
  • opt2 : with automatic fusion

In each version, it first compiles its C code and runs the generated binary with a different number of threads (i.e. 1/2/4/8/16/32/64). Each run computes a query 15 times and returns the average.

As a result, all output is saved into a log file, for example, log/naive/log_q6.log contains the result of query 6 in the naive version with all different number of threads.

Log file structures

log/naive/*.txt log/opt1/*.txt log/opt2/*.txt

Fetch a brief summary of execution time from a log file

cat log/naive/log_q6.txt | grep -E 'Run with 15 times'

q06>> Run with 15 times, last 15 average (ms): 266.638 | 278.999 266.134 266.417 <12 more> # 1 thread q06>> Run with 15 times, last 15 average (ms): 138.556 | 144.474 137.837 137.579 <12 more> # 2 threads q06>> Run with 15 times, last 15 average (ms): 71.8851 | 75.339 72.102 72.341 <12 more> # 4 threads q06>> Run with 15 times, last 15 average (ms): 73.111 | 75.867 72.53 72.936 <12 more> # 8 threads q06>> Run with 15 times, last 15 average (ms): 56.1003 | 59.263 56.057 56.039 <12 more> # 16 threads q06>> Run with 15 times, last 15 average (ms): 56.8858 | 59.466 56.651 57.109 <12 more> # 32 threads q06>> Run with 15 times, last 15 average (ms): 53.4254 | 55.884 54.457 52.878 <12 more> # 64 threads

It may become verbose when you have to extract information for all queries over three different kinds of versions. We provide a simple solution for it.

./run.sh fetch log | python gen_for_copy.py

Output data in the following format

// query id

| naive | opt1 | opt2 |

| ... | ... | ... | # 1 thread | ... | ... | ... | # 2 threads ... ... ... | ... | ... | ... | # 64 threads

Note that we copy the generated numbers into an Excel described in Section 3. Within an Excel file, we compare the performance difference in MonetDB and different versions of the generated C code.

Compilation time

Work directory

/home/hanfeng/cc20/horse/codegen

Fetch compilation time for different kinds of C code

./run.sh compile naive &> log_cc20_compile_naive.txt ./run.sh compile opt1 &> log_cc20_compile_opt1.txt ./run.sh compile opt2 &> log_cc20_compile_opt2.txt

Let's look into the result of query 1 in the log file log_cc20_compile_naive.txt.

Time variable usr sys wall GGC phase setup : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 1266 kB ( 18%) phase parsing : 0.07 ( 54%) 0.07 ( 88%) 0.14 ( 64%) 3897 kB ( 55%) phase opt and generate : 0.06 ( 46%) 0.01 ( 12%) 0.07 ( 32%) 1899 kB ( 27%) dump files : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 9%) 0 kB ( 0%) df reg dead/unused notes : 0.01 ( 8%) 0.00 ( 0%) 0.00 ( 0%) 31 kB ( 0%) register information : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 0 kB ( 0%) preprocessing : 0.03 ( 23%) 0.02 ( 25%) 0.08 ( 36%) 1468 kB ( 21%) lexical analysis : 0.00 ( 0%) 0.03 ( 38%) 0.05 ( 23%) 0 kB ( 0%) parser (global) : 0.04 ( 31%) 0.02 ( 25%) 0.01 ( 5%) 2039 kB ( 29%) tree SSA other : 0.00 ( 0%) 0.01 ( 12%) 0.00 ( 0%) 3 kB ( 0%) integrated RA : 0.01 ( 8%) 0.00 ( 0%) 0.01 ( 5%) 726 kB ( 10%) thread pro- & epilogue : 0.02 ( 15%) 0.00 ( 0%) 0.00 ( 0%) 41 kB ( 1%) shorten branches : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 0 kB ( 0%) final : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 56 kB ( 1%) initialize rtl : 0.01 ( 8%) 0.00 ( 0%) 0.01 ( 5%) 12 kB ( 0%) rest of compilation : 0.01 ( 8%) 0.00 ( 0%) 0.00 ( 0%) 62 kB ( 1%) TOTAL : 0.13 0.08 0.22 7072 kB

The whole compilation time is split into many parts. We take the total wall time as the actual time spent on the code compilation. In this query, it needs 0.22 seconds to complete the whole compilation. (Note that manual work is required for retrieving the compilation time.)

3.

Search
Clear search
Close search
Google apps
Main menu