6 datasets found

Z
CC20 Artifact - Automatic Fusion
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Krolik (2020). CC20 Artifact - Automatic Fusion [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3608382
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Alexander Krolik
Clark Verbrugge
Laurie Hendren
Bettina Kemme
Hanfeng Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Getting started

The title of our paper submitted to CC20 is

Improving Database Query Performance with Automatic Fusion

This repository is created for showing the reproducibility of our experiments in this paper. We provide the details of scripts and original data used in the experiments. There are mainly two systems: HorsePower and RDBMS MonetDB. We supply step-by-step instructions to configure and deploy both systems in the experiments.

On this page, you will see:

how to run experiments (Section 2); and

the results used in the paper (Section 3);

Experiments

All experiments were run on a server called sable-intel equipped with

Ubuntu 16.04.6 LTS (64-bit)

4 Intel Xeon E7-4850 2.00 GHz

total 40 cores with 80 threads

128GB RAM

Docker setup

Download the docker image: cc20-docker.tar (About 13GB)

docker load < cc20-docker.tar

Generate a named container (then exit)

docker run --hostname sableintel -it --name=container-cc20 wukefe/cc20-docker exit

Then, you can run the container

docker start -ai container-cc20

Open a new terminal to access the container (optional)

docker exec -it container-cc20 /bin/bash

Introduction to MonetDB

Work directory for MonetDB

/home/hanfeng/cc20/monetdb

Start MonetDB (use all available threads)

./run.sh start

Login MonetDB using its client tool, mclient

mclient -d tpch1

... MonetDB version v11.33.3 (Apr2019)

sql> SELECT 'Hello world'; +-------------+ | L2 | +=============+ | Hello world | +-------------+ 1 tuple

Show the list of tables in the current database

sql> \d TABLE sys.customer TABLE sys.lineitem TABLE sys.nation TABLE sys.orders TABLE sys.part TABLE sys.partsupp TABLE sys.region TABLE sys.supplier

Leave the session

sql> \q

Stop MonetDB before we can continue our experiments

./run.sh stop

Reference: How to install MonetDB and the introduction of server and client programs.

Run MonetDB with TPC-H queries

MonetDB: server mode

Invoke MonetDB with a specific number of threads (e.g. 1)

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=1

Open a new terminal

docker exec -it container-cc20 /bin/bash cd cc20/monetdb

Note: Type \q to exit the server mode.

Run with a specific number of threads (Two terminals required)

1 thread

terminal 1

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=1

terminal 2

(time ./runtest | mclient -d tpch1) &> "log/log_thread_1.log"

2 threads

terminal 1

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=2

terminal 2

(time ./runtest | mclient -d tpch1) &> "log/log_thread_2.log"

4 threads

terminal 1

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=4

terminal 2

(time ./runtest | mclient -d tpch1) &> "log/log_thread_4.log"

8 threads

terminal 1

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=8

terminal 2

(time ./runtest | mclient -d tpch1) &> "log/log_thread_8.log"

16 threads

terminal 1

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=16

terminal 2

(time ./runtest | mclient -d tpch1) &> "log/log_thread_16.log"

32 threads

terminal 1

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=32

terminal 2

(time ./runtest | mclient -d tpch1) &> "log/log_thread_32.log"

64 threads

terminal 1

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=64

terminal 2

(time ./runtest | mclient -d tpch1) &> "log/log_thread_64.log"

Post data processing - MonetDB

Fetch average execution time (ms)

grep -A 3 avg_query log/log_thread_1.log | python cut.py

699.834133333 // q1 85.9178666667 // q4 65.0172 // q6 101.730666667 // q12 58.212 // q14 60.1138666667 // q16 248.926466667 // q19 77.6482 // q22

grep -A 3 avg_query log/log_thread_2.log | python cut.py grep -A 3 avg_query log/log_thread_4.log | python cut.py grep -A 3 avg_query log/log_thread_8.log | python cut.py grep -A 3 avg_query log/log_thread_16.log | python cut.py grep -A 3 avg_query log/log_thread_32.log | python cut.py grep -A 3 avg_query log/log_thread_64.log | python cut.py

Note: The above numbers can be copied to an Excel file for further analysis before plotting figures. Details can be found in Section 3.

Run with HorseIR

The HorsePower project can be found on GitHub. In the docker image, it has been placed in /home/hanfeng/cc20/horse.

https://github.com/Sable/HorsePower

Execution time

We then run each query 15 times to get the average execution time (ms).

(cd /home/hanfeng/cc20/horse/ && time ./run_all.sh)

The script run_all.sh runs over three versions of generated C code based on different levels of optimizations.

naive : no optimization

opt1 : with optimizations

opt2 : with automatic fusion

In each version, it first compiles its C code and runs the generated binary with a different number of threads (i.e. 1/2/4/8/16/32/64). Each run computes a query 15 times and returns the average.

As a result, all output is saved into a log file, for example, log/naive/log_q6.log contains the result of query 6 in the naive version with all different number of threads.

Log file structures

log/naive/*.txt log/opt1/*.txt log/opt2/*.txt

Fetch a brief summary of execution time from a log file

cat log/naive/log_q6.txt | grep -E 'Run with 15 times'

q06>> Run with 15 times, last 15 average (ms): 266.638 | 278.999 266.134 266.417 <12 more> # 1 thread q06>> Run with 15 times, last 15 average (ms): 138.556 | 144.474 137.837 137.579 <12 more> # 2 threads q06>> Run with 15 times, last 15 average (ms): 71.8851 | 75.339 72.102 72.341 <12 more> # 4 threads q06>> Run with 15 times, last 15 average (ms): 73.111 | 75.867 72.53 72.936 <12 more> # 8 threads q06>> Run with 15 times, last 15 average (ms): 56.1003 | 59.263 56.057 56.039 <12 more> # 16 threads q06>> Run with 15 times, last 15 average (ms): 56.8858 | 59.466 56.651 57.109 <12 more> # 32 threads q06>> Run with 15 times, last 15 average (ms): 53.4254 | 55.884 54.457 52.878 <12 more> # 64 threads

It may become verbose when you have to extract information for all queries over three different kinds of versions. We provide a simple solution for it.

./run.sh fetch log | python gen_for_copy.py

Output data in the following format

// query id

| naive | opt1 | opt2 |

| ... | ... | ... | # 1 thread | ... | ... | ... | # 2 threads ... ... ... | ... | ... | ... | # 64 threads

Note that we copy the generated numbers into an Excel described in Section 3. Within an Excel file, we compare the performance difference in MonetDB and different versions of the generated C code.

Compilation time

Work directory

/home/hanfeng/cc20/horse/codegen

Fetch compilation time for different kinds of C code

./run.sh compile naive &> log_cc20_compile_naive.txt ./run.sh compile opt1 &> log_cc20_compile_opt1.txt ./run.sh compile opt2 &> log_cc20_compile_opt2.txt

Let's look into the result of query 1 in the log file log_cc20_compile_naive.txt.

Time variable usr sys wall GGC phase setup : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 1266 kB ( 18%) phase parsing : 0.07 ( 54%) 0.07 ( 88%) 0.14 ( 64%) 3897 kB ( 55%) phase opt and generate : 0.06 ( 46%) 0.01 ( 12%) 0.07 ( 32%) 1899 kB ( 27%) dump files : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 9%) 0 kB ( 0%) df reg dead/unused notes : 0.01 ( 8%) 0.00 ( 0%) 0.00 ( 0%) 31 kB ( 0%) register information : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 0 kB ( 0%) preprocessing : 0.03 ( 23%) 0.02 ( 25%) 0.08 ( 36%) 1468 kB ( 21%) lexical analysis : 0.00 ( 0%) 0.03 ( 38%) 0.05 ( 23%) 0 kB ( 0%) parser (global) : 0.04 ( 31%) 0.02 ( 25%) 0.01 ( 5%) 2039 kB ( 29%) tree SSA other : 0.00 ( 0%) 0.01 ( 12%) 0.00 ( 0%) 3 kB ( 0%) integrated RA : 0.01 ( 8%) 0.00 ( 0%) 0.01 ( 5%) 726 kB ( 10%) thread pro- & epilogue : 0.02 ( 15%) 0.00 ( 0%) 0.00 ( 0%) 41 kB ( 1%) shorten branches : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 0 kB ( 0%) final : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 56 kB ( 1%) initialize rtl : 0.01 ( 8%) 0.00 ( 0%) 0.01 ( 5%) 12 kB ( 0%) rest of compilation : 0.01 ( 8%) 0.00 ( 0%) 0.00 ( 0%) 62 kB ( 1%) TOTAL : 0.13 0.08 0.22 7072 kB

The whole compilation time is split into many parts. We take the total wall time as the actual time spent on the code compilation. In this query, it needs 0.22 seconds to complete the whole compilation. (Note that manual work is required for retrieving the compilation time.)

3.
Data from: Open-data release of aggregated Australian school-level...
zenodo.org
data.niaid.nih.gov
bin, csv, txt
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monteiro Lobato,; Monteiro Lobato, (2020). Open-data release of aggregated Australian school-level information. Edition 2016.1 [Dataset]. http://doi.org/10.5281/zenodo.46086
Explore at:
csv, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.46086
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Monteiro Lobato,; Monteiro Lobato,
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The file set is a freely downloadable aggregation of information about Australian schools. The individual files represent a series of tables which, when considered together, form a relational database. The records cover the years 2008-2014 and include information on approximately 9500 primary and secondary school main-campuses and around 500 subcampuses. The records all relate to school-level data; no data about individuals is included. All the information has previously been published and is publicly available but it has not previously been released as a documented, useful aggregation. The information includes:
(a) the names of schools
(b) staffing levels, including full-time and part-time teaching and non-teaching staff
(c) student enrolments, including the number of boys and girls
(d) school financial information, including Commonwealth government, state government, and private funding
(e) test data, potentially for school years 3, 5, 7 and 9, relating to an Australian national testing programme know by the trademark 'NAPLAN'

Documentation of this Edition 2016.1 is incomplete but the organization of the data should be readily understandable to most people. If you are a researcher, the simplest way to study the data is to make use of the SQLite3 database called 'school-data-2016-1.db'. If you are unsure how to use an SQLite database, ask a guru.

The database was constructed directly from the other included files by running the following command at a command-line prompt:
sqlite3 school-data-2016-1.db < school-data-2016-1.sql
Note that a few, non-consequential, errors will be reported if you run this command yourself. The reason for the errors is that the SQLite database is created by importing a series of '.csv' files. Each of the .csv files contains a header line with the names of the variable relevant to each column. The information is useful for many statistical packages but it is not what SQLite expects, so it complains about the header. Despite the complaint, the database will be created correctly.

Briefly, the data are organized as follows.
(a) The .csv files ('comma separated values') do not actually use a comma as the field delimiter. Instead, the vertical bar character '|' (ASCII Octal 174 Decimal 124 Hex 7C) is used. If you read the .csv files using Microsoft Excel, Open Office, or Libre Office, you will need to set the field-separator to be '|'. Check your software documentation to understand how to do this.
(b) Each school-related record is indexed by an identifer called 'ageid'. The ageid uniquely identifies each school and consequently serves as the appropriate variable for JOIN-ing records in different data files. For example, the first school-related record after the header line in file 'students-headed-bar.csv' shows the ageid of the school as 40000. The relevant school name can be found by looking in the file 'ageidtoname-headed-bar.csv' to discover that the the ageid of 40000 corresponds to a school called 'Corpus Christi Catholic School'.
(3) In addition to the variable 'ageid' each record is also identified by one or two 'year' variables. The most important purpose of a year identifier will be to indicate the year that is relevant to the record. For example, if one turn again to file 'students-headed-bar.csv', one sees that the first seven school-related records after the header line all relate to the school Corpus Christi Catholic School with ageid of 40000. The variable that identifies the important differences between these seven records is the variable 'studentyear'. 'studentyear' shows the year to which the student data refer. One can see, for example, that in 2008, there were a total of 410 students enrolled, of whom 185 were girls and 225 were boys (look at the variable names in the header line).
(4) The variables relating to years are given different names in each of the different files ('studentsyear' in the file 'students-headed-bar.csv', 'financesummaryyear' in the file 'financesummary-headed-bar.csv'). Despite the different names, the year variables provide the second-level means for joining information acrosss files. For example, if you wanted to relate the enrolments at a school in each year to its financial state, you might wish to JOIN records using 'ageid' in the two files and, secondarily, matching 'studentsyear' with 'financialsummaryyear'.
(5) The manipulation of the data is most readily done using the SQL language with the SQLite database but it can also be done in a variety of statistical packages.
(6) It is our intention for Edition 2016-2 to create large 'flat' files suitable for use by non-researchers who want to view the data with spreadsheet software. The disadvantage of such 'flat' files is that they contain vast amounts of redundant information and might not display the data in the form that the user most wants it.
(7) Geocoding of the schools is not available in this edition.
(8) Some files, such as 'sector-headed-bar.csv' are not used in the creation of the database but are provided as a convenience for researchers who might wish to recode some of the data to remove redundancy.
(9) A detailed example of a suitable SQLite query can be found in the file 'school-data-sqlite-example.sql'. The same query, used in the context of analyses done with the excellent, freely available R statistical package (http://www.r-project.org) can be seen in the file 'school-data-with-sqlite.R'.
u
University of Cape Town Student Admissions Data 2006-2014 - South Africa
datafirst.uct.ac.za
Updated Jul 28, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCT Student Administration (2020). University of Cape Town Student Admissions Data 2006-2014 - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/556
Explore at:
Dataset updated
Jul 28, 2020
Dataset authored and provided by
UCT Student Administration
Time period covered
2006 - 2014
Area covered
South Africa
Description
Abstract

This dataset was generated from a set of Excel spreadsheets from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). This database contains information on applications to UCT between the January 2006 and December 2014. In the original form received by DataFirst the data were ill suited to research purposes. This dataset represents an attempt at cleaning and organizing these data into a more tractable format. To ensure data confidentiality direct identifiers have been removed from the data and the data is only made available to accredited researchers through DataFirst's Secure Data Service.

The dataset was separated into the following data files:

Application level information: the "finest" unit of analysis. Individuals may have multiple applications. Uniquely identified by an application ID variable. There are a total of 1,714,669 applications on record.

Individual level information: individuals may have multiple applications. Each individual is uniquely identified by an individual ID variable. Each individual is associated with information on "key subjects" from a separate data file also contained in the database. These key subjects are all separate variables in the individual level data file. There are a total of 285,005 individuals on record.

Secondary Education Information: individuals can also be associated with row entries for each subject. This data file does not have a unique identifier. Instead, each row entry represents a specific secondary school subject for a specific individual. These subjects are quite specific and the data allows the user to distinguish between, for example, higher grade accounting and standard grade accounting. It also allows the user to identify the educational authority issuing the qualification e.g. Cambridge Internal Examinations (CIE) versus National Senior Certificate (NSC).

Tertiary Education Information: the smallest of the four data files. There are multiple entries for each individual in this dataset. Each row entry contains information on the year, institution and transcript information and can be associated with individuals.

Analysis unit

Applications, individuals

Kind of data

Administrative records [adm]

Mode of data collection

Other [oth]

Cleaning operations

The data files were made available to DataFirst as a group of Excel spreadsheet documents from an SQL database managed by the University of Cape Town's Information and Communication Technology Services . The process of combining these original data files to create a research-ready dataset is summarised in a document entitled "Notes on preparing the UCT Student Application Data 2006-2014" accompanying the data.
S
LADOT Parking Meter Occupancy - Archive
splitgraph.com
data.lacity.org
+2more
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LADOT Parking Meters Division - LA ExpressPark (2024). LADOT Parking Meter Occupancy - Archive [Dataset]. https://www.splitgraph.com/lacity/ladot-parking-meter-occupancy-archive-cj8s-ivry/
Explore at:
json, application/vnd.splitgraph.image, application/openapi+jsonAvailable download formats
Dataset updated
Oct 2, 2024
Dataset authored and provided by
LADOT Parking Meters Division - LA ExpressPark
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Monthly archive of all parking meter sensor activity over the previous 36 months (3 years). Updated monthly for data 2 months prior (eg. January data will be published early March).

For best-available current "live" status, see "LADOT Parking Meter Occupancy".

For location and parking policy details, see "LADOT Metered Parking Inventory & Policies".

This dataset is geared towards database professionals and/or app developers. Each file is extremely large, over 300MB at minimum. Common applications like Microsoft Excel will not be able to open the file and show all data.

** For best results, import into a database or use advanced data access methods appropriate for processing large files.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
O*NET Database
onetcenter.org
excel, mysql, oracle +2
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Center for O*NET Development (2025). O*NET Database [Dataset]. https://www.onetcenter.org/database.html
Explore at:
oracle, sql server, text, mysql, excelAvailable download formats
Dataset updated
May 22, 2025
Dataset provided by
Occupational Information Network
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Dataset funded by
United States Department of Laborhttp://www.dol.gov/
Description
The O*NET Database contains hundreds of standardized and occupation-specific descriptors on almost 1,000 occupations covering the entire U.S. economy. The database, which is available to the public at no cost, is continually updated by a multi-method data collection program. Sources of data include: job incumbents, occupational experts, occupational analysts, employer job postings, and customer/professional association input.
Data content areas include:
Worker Characteristics (e.g., Abilities, Interests, Work Styles)
Worker Requirements (e.g., Education, Knowledge, Skills)
Experience Requirements (e.g., On-the-Job Training, Work Experience)
Occupational Requirements (e.g., Detailed Work Activities, Work Context)
Occupation-Specific Information (e.g., Job Titles, Tasks, Technology Skills)
r
Australian Public Holidays Dates Machine Readable Dataset
researchdata.edu.au
Updated Feb 24, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of the Prime Minister and Cabinet (2014). Australian Public Holidays Dates Machine Readable Dataset [Dataset]. https://researchdata.edu.au/australian-public-holidays-readable-dataset/2995651
Explore at:
Dataset updated
Feb 24, 2014
Dataset provided by
data.gov.au
Authors
Department of the Prime Minister and Cabinet
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered
Australia
Description
The Department of the Prime Minister and Cabinet is no longer maintaining this dataset. If you would like to take ownership of this dataset for ongoing maintenance please contact us.\r \r ---\r \r PLEASE READ BEFORE USING\r \r The data format has been updated to align with a tidy data style (http://vita.had.co.nz/papers/tidy-data.html).\r \r The data in this dataset is manually collected and combined in a csv format from the following state and territory portals:\r \r - https://www.cmtedd.act.gov.au/communication/holidays\r - https://www.nsw.gov.au/about-nsw/public-holidays\r - https://nt.gov.au/nt-public-holidays\r - https://www.qld.gov.au/recreation/travel/holidays/public\r - https://www.safework.sa.gov.au/resources/public-holidays\r - https://worksafe.tas.gov.au/topics/laws-and-compliance/public-holidays\r - https://business.vic.gov.au/business-information/public-holidays\r - https://www.commerce.wa.gov.au/labour-relations/public-holidays-western-australia\r \r The data API by default returns only the first 100 records. The JSON response will contain a key that shows the link for the next page of records.\r Alternatively you can view all records by updating the limit on the endpoint or using a query to select all records, i.e. /api/3/action/datastore_search_sql?sql=SELECT * from "{{resource_id}}".\r \r
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Alexander Krolik (2020). CC20 Artifact - Automatic Fusion [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3608382

CC20 Artifact - Automatic Fusion

Explore at:

Dataset updated

Jan 24, 2020

Dataset provided by

Alexander Krolik
Clark Verbrugge
Laurie Hendren
Bettina Kemme
Hanfeng Chen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Getting started

The title of our paper submitted to CC20 is

Improving Database Query Performance with Automatic Fusion

This repository is created for showing the reproducibility of our experiments in this paper. We provide the details of scripts and original data used in the experiments. There are mainly two systems: HorsePower and RDBMS MonetDB. We supply step-by-step instructions to configure and deploy both systems in the experiments.

On this page, you will see:

how to run experiments (Section 2); and

the results used in the paper (Section 3);

Experiments

All experiments were run on a server called sable-intel equipped with

Ubuntu 16.04.6 LTS (64-bit)

4 Intel Xeon E7-4850 2.00 GHz

total 40 cores with 80 threads

128GB RAM

Docker setup

Download the docker image: cc20-docker.tar (About 13GB)

docker load < cc20-docker.tar

Generate a named container (then exit)

docker run --hostname sableintel -it --name=container-cc20 wukefe/cc20-docker exit

Then, you can run the container

docker start -ai container-cc20

Open a new terminal to access the container (optional)

docker exec -it container-cc20 /bin/bash

Introduction to MonetDB

Work directory for MonetDB

/home/hanfeng/cc20/monetdb

Start MonetDB (use all available threads)

./run.sh start

mclient -d tpch1

... MonetDB version v11.33.3 (Apr2019)

sql> SELECT 'Hello world'; +-------------+ | L2 | +=============+ | Hello world | +-------------+ 1 tuple

Show the list of tables in the current database

sql> \d TABLE sys.customer TABLE sys.lineitem TABLE sys.nation TABLE sys.orders TABLE sys.part TABLE sys.partsupp TABLE sys.region TABLE sys.supplier

Leave the session

sql> \q

Stop MonetDB before we can continue our experiments

./run.sh stop

Reference: How to install MonetDB and the introduction of server and client programs.

Run MonetDB with TPC-H queries

MonetDB: server mode

Invoke MonetDB with a specific number of threads (e.g. 1)

mserver5 --set embedded_py=true --dbpath=/home/hanfeng/datafarm/2019/tpch1 --set monet_vault_key=/home/hanfeng/datafarm/2019/tpch1/.vaultkey --set gdk_nr_threads=1

Open a new terminal

docker exec -it container-cc20 /bin/bash cd cc20/monetdb

Note: Type \q to exit the server mode.

(time ./runtest | mclient -d tpch1) &> "log/log_thread_64.log"

Post data processing - MonetDB

Fetch average execution time (ms)

grep -A 3 avg_query log/log_thread_1.log | python cut.py

699.834133333 // q1 85.9178666667 // q4 65.0172 // q6 101.730666667 // q12 58.212 // q14 60.1138666667 // q16 248.926466667 // q19 77.6482 // q22

grep -A 3 avg_query log/log_thread_2.log | python cut.py grep -A 3 avg_query log/log_thread_4.log | python cut.py grep -A 3 avg_query log/log_thread_8.log | python cut.py grep -A 3 avg_query log/log_thread_16.log | python cut.py grep -A 3 avg_query log/log_thread_32.log | python cut.py grep -A 3 avg_query log/log_thread_64.log | python cut.py

Note: The above numbers can be copied to an Excel file for further analysis before plotting figures. Details can be found in Section 3.

Run with HorseIR

The HorsePower project can be found on GitHub. In the docker image, it has been placed in /home/hanfeng/cc20/horse.

https://github.com/Sable/HorsePower

Execution time

We then run each query 15 times to get the average execution time (ms).

(cd /home/hanfeng/cc20/horse/ && time ./run_all.sh)

The script run_all.sh runs over three versions of generated C code based on different levels of optimizations.

naive : no optimization
opt1 : with optimizations
opt2 : with automatic fusion

In each version, it first compiles its C code and runs the generated binary with a different number of threads (i.e. 1/2/4/8/16/32/64). Each run computes a query 15 times and returns the average.

As a result, all output is saved into a log file, for example, log/naive/log_q6.log contains the result of query 6 in the naive version with all different number of threads.

Log file structures

log/naive/*.txt log/opt1/*.txt log/opt2/*.txt

Fetch a brief summary of execution time from a log file

cat log/naive/log_q6.txt | grep -E 'Run with 15 times'

q06>> Run with 15 times, last 15 average (ms): 266.638 | 278.999 266.134 266.417 <12 more> # 1 thread q06>> Run with 15 times, last 15 average (ms): 138.556 | 144.474 137.837 137.579 <12 more> # 2 threads q06>> Run with 15 times, last 15 average (ms): 71.8851 | 75.339 72.102 72.341 <12 more> # 4 threads q06>> Run with 15 times, last 15 average (ms): 73.111 | 75.867 72.53 72.936 <12 more> # 8 threads q06>> Run with 15 times, last 15 average (ms): 56.1003 | 59.263 56.057 56.039 <12 more> # 16 threads q06>> Run with 15 times, last 15 average (ms): 56.8858 | 59.466 56.651 57.109 <12 more> # 32 threads q06>> Run with 15 times, last 15 average (ms): 53.4254 | 55.884 54.457 52.878 <12 more> # 64 threads

It may become verbose when you have to extract information for all queries over three different kinds of versions. We provide a simple solution for it.

./run.sh fetch log | python gen_for_copy.py

Output data in the following format

// query id

| naive | opt1 | opt2 |

| ... | ... | ... | # 1 thread | ... | ... | ... | # 2 threads ... ... ... | ... | ... | ... | # 64 threads

Note that we copy the generated numbers into an Excel described in Section 3. Within an Excel file, we compare the performance difference in MonetDB and different versions of the generated C code.

Compilation time

Work directory

/home/hanfeng/cc20/horse/codegen

Fetch compilation time for different kinds of C code

./run.sh compile naive &> log_cc20_compile_naive.txt ./run.sh compile opt1 &> log_cc20_compile_opt1.txt ./run.sh compile opt2 &> log_cc20_compile_opt2.txt

Let's look into the result of query 1 in the log file log_cc20_compile_naive.txt.

Time variable usr sys wall GGC phase setup : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 1266 kB ( 18%) phase parsing : 0.07 ( 54%) 0.07 ( 88%) 0.14 ( 64%) 3897 kB ( 55%) phase opt and generate : 0.06 ( 46%) 0.01 ( 12%) 0.07 ( 32%) 1899 kB ( 27%) dump files : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 9%) 0 kB ( 0%) df reg dead/unused notes : 0.01 ( 8%) 0.00 ( 0%) 0.00 ( 0%) 31 kB ( 0%) register information : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 0 kB ( 0%) preprocessing : 0.03 ( 23%) 0.02 ( 25%) 0.08 ( 36%) 1468 kB ( 21%) lexical analysis : 0.00 ( 0%) 0.03 ( 38%) 0.05 ( 23%) 0 kB ( 0%) parser (global) : 0.04 ( 31%) 0.02 ( 25%) 0.01 ( 5%) 2039 kB ( 29%) tree SSA other : 0.00 ( 0%) 0.01 ( 12%) 0.00 ( 0%) 3 kB ( 0%) integrated RA : 0.01 ( 8%) 0.00 ( 0%) 0.01 ( 5%) 726 kB ( 10%) thread pro- & epilogue : 0.02 ( 15%) 0.00 ( 0%) 0.00 ( 0%) 41 kB ( 1%) shorten branches : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 0 kB ( 0%) final : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 5%) 56 kB ( 1%) initialize rtl : 0.01 ( 8%) 0.00 ( 0%) 0.01 ( 5%) 12 kB ( 0%) rest of compilation : 0.01 ( 8%) 0.00 ( 0%) 0.00 ( 0%) 62 kB ( 1%) TOTAL : 0.13 0.08 0.22 7072 kB

The whole compilation time is split into many parts. We take the total wall time as the actual time spent on the code compilation. In this query, it needs 0.22 seconds to complete the whole compilation. (Note that manual work is required for retrieving the compilation time.)

Clear search

Close search

Google apps

Main menu

CC20 Artifact - Automatic Fusion

... MonetDB version v11.33.3 (Apr2019)

terminal 1

terminal 2

terminal 1

terminal 2

terminal 1

terminal 2

terminal 1

terminal 2

terminal 1

terminal 2

terminal 1

terminal 2

terminal 1

terminal 2

| naive | opt1 | opt2 |

Data from: Open-data release of aggregated Australian school-level...

University of Cape Town Student Admissions Data 2006-2014 - South Africa

Abstract

Analysis unit

Kind of data

Mode of data collection

Cleaning operations

LADOT Parking Meter Occupancy - Archive

O*NET Database

Australian Public Holidays Dates Machine Readable Dataset

CC20 Artifact - Automatic Fusion

... MonetDB version v11.33.3 (Apr2019)

terminal 1

terminal 2

terminal 1

terminal 2

terminal 1

terminal 2

terminal 1

terminal 2

terminal 1

terminal 2

terminal 1

terminal 2

terminal 1

terminal 2

| naive | opt1 | opt2 |