Optimizing queries

Last updated 3 months ago

Even though in each release performance improvements are included to make gitbase faster, there are some queries that might take too long. By rewriting them in some ways, you can squeeze that extra performance you need by taking advantage of some optimisations that are already in place.

There are two ways to optimize a gitbase query:

  • Create an index for some parts.

  • Making sure the joined tables are squashed.

Assessing performance bottlenecks

To assess if there is a performance bottleneck you might want to inspect the execution tree of the query. This is also very helpful when reporting performance issues on gitbase.

The output from an EXPLAIN query is represented as a tree and shows how the query is actually evaluated. You can do that using the following query:

EXPLAIN FORMAT=TREE <SQL QUERY TO EXPLAIN>

For example, the given query:

EXPLAIN FORMAT=TREE
SELECT * FROM refs
NATURAL JOIN ref_commits
WHERE ref_commits.history_index = 0

Will output something like this:

+-----------------------------------------------------------------------------------------+
| plan |
+-----------------------------------------------------------------------------------------+
| Project(refs.repository_id, refs.ref_name, refs.commit_hash, ref_commits.history_index) |
| └─ SquashedTable(refs, ref_commits) |
| ├─ Columns |
| │ ├─ Column(repository_id, TEXT, nullable=false) |
| │ ├─ Column(ref_name, TEXT, nullable=false) |
| │ ├─ Column(commit_hash, TEXT, nullable=false) |
| │ ├─ Column(repository_id, TEXT, nullable=false) |
| │ ├─ Column(commit_hash, TEXT, nullable=false) |
| │ ├─ Column(ref_name, TEXT, nullable=false) |
| │ └─ Column(history_index, INT64, nullable=false) |
| └─ Filters |
| ├─ refs.repository_id = ref_commits.repository_id |
| ├─ refs.ref_name = ref_commits.ref_name |
| ├─ refs.commit_hash = ref_commits.commit_hash |
| └─ ref_commits.history_index = 0 |
+-----------------------------------------------------------------------------------------+
15 rows in set (0.00 sec)

Detecting performance issues in the query tree

Some performance issues might not be obvious, but there are a few that really stand out by just looking at the query tree.

  • Joins not squashed. If you performed some joins between tables and instead of a SquashedTable node you see Join and Table nodes, it means the joins were not successfully squashed. There is a more detailed explanation about this in next sections of this document.

  • Indexes not used. If you can't see the indexes in your table nodes, it means somehow those indexes are not being used by the table. There is a more detailed explanation about this in next sections of this document.

Indexes

The more obvious way to improve the performance of a query is to create an index for such query. Since you can index multiple columns or a single arbitrary expression, this may be useful for some kinds of queries. For example, if you're querying by language, you may want to index that so there is no need to compute the language each time.

CREATE INDEX files_language_idx ON files USING pilosa (language(file_path, blob_content))

Once you have the index in place, gitbase only looks for the rows with the values matching your conditions.

But beware, even if you have an index it's possible that gitbase will not use it. These are the forms an expression must have to make sure the index will be used.

  • <indexed expression> = <evaluable expression>

  • <indexed expression> < <evaluable expression>

  • <indexed expression> > <evaluable expression>

  • <indexed expression> <= <evaluable expression>

  • <indexed expression> >= <evaluable expression>

  • <indexed expression> != <evaluable expression>

  • <indexed expression> IN <evaluable expression>

  • <indexed expression> BETWEEN <evaluable expression> AND <evaluable expression>

<indexed expression> is the expression that was indexed when the index was created, in the previous case that would be language(file_path, blob_content). <evaluable expression> is any expression that can be evaluated without using the current row. For example, a literal ("foo"), a function that takes no column arguments (SUBSTRING("foo", 1)), etc.

So, if you have this query, the index would be used.

SELECT file_path FROM files WHERE language(file_path, blob_content) = 'Go'

But these queries would not use the index.

SELECT file_path FROM files WHERE language(file_path, blob_content) = SUBSTRING(file_path, 0, 2)
SELECT file_path FROM files WHERE language(file_path, blob_content) LIKE 'G_'

Note that when you use an index on multiple columns, there is a limitation (that may change in the future) that requires all columns sharing the same operation.

For example, let's make an index on two columns.

CREATE INDEX commits_multi_idx ON commits USING pilosa (committer_name, committer_email)

This query would use the index.

SELECT * FROM commits WHERE committer_name = 'John Doe' AND committer_email = 'foo@example.com'

These, however, would not use the index.

SELECT * FROM commits WHERE committer_name = 'John Doe'

All columns in an index need to be present in the filters.

SELECT * FROM commits WHERE committer_name = 'John Doe' AND committer_email != 'foo@example.com'

All the columns need to use the same operation. In this case, one is using = and the other !=. This is a current limitation that will be removed in the future.

Squash tables

There is an optimization done inside gitbase called squashed tables. Instead of reading all the data from the tables and then performing the join, a squashed table is the union of several tables in which the output of a table is generated using the output of the previous one.

Imagine we want to join commits, commit_files and files. Without the squashed joins we would read all commits, all commit_files and all files. Then, we would join all these rows. This is an incredibly expensive operation for large repositories. With squashed tables, however, we read all commits, then, for each commit we generate the commit_files for that commit and then for each commit file we generate the files for them. This has two advantages:

  • Filters are applied early on, which reduces the amount of data that needs to be read. If you filtered commits by a particular author in our previous example, only commit files, and thus files, by that commit author would be read, instead of all of them.

  • It works with raw git objects, not database rows, which makes it way more performant since there is no need to serialize and deserialize.

As a result, your query could be orders of magnitude faster.

Limitations

Only works per repository. This optimisation is built on top of some premises, one of them is the fact that all tables are joined by repository_id.

This query will get squashed, because NATURAL JOIN makes sure all columns with equal names are used in the join.

SELECT * FROM refs NATURAL JOIN ref_commits NATURAL JOIN commits

This query, however, will not be squashed.

SELECT * FROM refs r
INNER JOIN ref_commits rc ON r.ref_name = rc.ref_name
INNER JOIN commits c ON rc.commit_hash = c.commit_hash

It requires some filters to be present in order to perform the squash.

This query will be squashed.

SELECT * FROM commit_files NATURAL JOIN files

This query will not be squashed, as the join between commit_files and files requires more filters to be squashed.

SELECT * FROM commit_files cf
INNER JOIN files f ON cf.file_path = f.file_path

TIP: we suggest always using NATURAL JOIN for joining tables, since it's less verbose and already satisfies all the filters for squashing tables. The only exception to this advice is when joining refs and ref_commits. A NATURAL JOIN between refs and ref_commits will only get the HEAD commit of the reference. The same happens with commits and commit_trees/commit_files.

You can find the full list of conditions that need to be met for the squash to be applied here.

Only works if the tables joined follow a hierarchy. Joinin commits and files does not work, or joining blobs with files. It needs to follow one of the hierarchies of tables.

repositories -> refs -> ref_commits -> commits -> commit_trees -> tree_entries -> blobs
repositories -> refs -> ref_commits -> commits -> commit_blobs -> blobs
repositories -> refs -> ref_commits -> commits -> commit_files -> files
repositories -> remotes -> refs -> (any of the other hierarchies)

As long as the tables you join are a subset of any of these hierarchies, it will be applied, provided you gave the proper filters. If only some part follows the hierarchy, the leftmost squash will be performed.

For example, if we join repositories, remotes, and then commit_blobs and blobs, the result will be a squashed table of repositories and remotes and a regular join with commit_blobs and blobs. The rule will try to squash as many tables as possible.

How to check if the squash was applied

You can check if the squash optimisation was applied to your query by using the DESCRIBE command.

DESCRIBE FORMAT=TREE <your query>

This will pretty-print the analyzed tree of your query. If you see a node named SquashedTable it means your query was squashed, otherwise some part of your query is not squashable or a filter might be missing.

List of filters for squashed tables

T1.repository_id = T2.repository_id: all tables must be joined by repository_id.

refs with ref_commits

  • refs.ref_name = ref_commits.ref_name

  • refs.commit_hash = ref_commits.commit_hash (only if you want to get just the HEAD commit)

refs with commits

  • refs.commit_hash = commits.commit_hash

refs with commit_trees

  • refs.commit_hash = commit_trees.commit_hash

refs with commit_blobs

  • refs.commit_hash = commit_blobs.commit_hash

refs with commit_files

  • refs.commit_hash = commit_files.commit_hash

ref_commits with commits

  • ref_commits.commit_hash = commits.commit_hash

ref_commits with commit_trees

  • ref_commits.commit_hash = commit_trees.commit_hash

ref_commits with commit_blobs

  • ref_commits.commit_hash = commit_blobs.commit_hash

ref_commits with commit_files

  • ref_commits.commit_hash = commit_files.commit_hash

  • commits.tree_hash = commit_files.tree_hash (only if you want just the main commit tree files)

commits with commit_trees

  • commits.commit_hash = commit_trees.commit_hash

  • commits.tree_hash = commit_trees.tree_hash (only if you want just the main commit tree)

commits with commit_blobs

  • commits.commit_hash = commit_blobs.commit_hash

commits with commit_files

  • commits.commit_hash = commit_files.commit_hash

commits with tree_entries

  • commits.tree_hash = tree_entries.tree_hash

commit_trees with tree_entries

  • commit_trees.tree_hash = tree_entries.tree_hash

commit_blobs with blobs

  • commit_blobs.blob_hash = blobs.blob_hash

tree_entries with blobs

  • tree_entries.blob_hash = blobs.blob_hash

commit_files with files

  • commit_files.file_path = files.file_path

  • commit_files.tree_hash = files.tree_hash

  • commit_files.blob_hash = files.blob_hash