Even though in each release performance improvements are included to make gitbase faster, there are some queries that might take too long. By rewriting them in some ways, you can squeeze that extra performance you need by taking advantage of some optimisations that are already in place.
There are two ways to optimize a gitbase query:
Create an index for some parts.
Making sure the joined tables are squashed.
To assess if there is a performance bottleneck you might want to inspect the execution tree of the query. This is also very helpful when reporting performance issues on gitbase.
The output from an
EXPLAIN query is represented as a tree and shows how the query is actually evaluated. You can do that using the following query:
EXPLAIN FORMAT=TREE <SQL QUERY TO EXPLAIN>
For example, the given query:
EXPLAIN FORMAT=TREESELECT * FROM refsNATURAL JOIN ref_commitsWHERE ref_commits.history_index = 0
Will output something like this:
+-----------------------------------------------------------------------------------------+| plan |+-----------------------------------------------------------------------------------------+| Project(refs.repository_id, refs.ref_name, refs.commit_hash, ref_commits.history_index) || └─ SquashedTable(refs, ref_commits) || ├─ Columns || │ ├─ Column(repository_id, TEXT, nullable=false) || │ ├─ Column(ref_name, TEXT, nullable=false) || │ ├─ Column(commit_hash, TEXT, nullable=false) || │ ├─ Column(repository_id, TEXT, nullable=false) || │ ├─ Column(commit_hash, TEXT, nullable=false) || │ ├─ Column(ref_name, TEXT, nullable=false) || │ └─ Column(history_index, INT64, nullable=false) || └─ Filters || ├─ refs.repository_id = ref_commits.repository_id || ├─ refs.ref_name = ref_commits.ref_name || ├─ refs.commit_hash = ref_commits.commit_hash || └─ ref_commits.history_index = 0 |+-----------------------------------------------------------------------------------------+15 rows in set (0.00 sec)
Some performance issues might not be obvious, but there are a few that really stand out by just looking at the query tree.
Joins not squashed. If you performed some joins between tables and instead of a
SquashedTable node you see
Table nodes, it means the joins were not successfully squashed. There is a more detailed explanation about this in next sections of this document.
Indexes not used. If you can't see the indexes in your table nodes, it means somehow those indexes are not being used by the table. There is a more detailed explanation about this in next sections of this document.
The more obvious way to improve the performance of a query is to create an index for such query. Since you can index multiple columns or a single arbitrary expression, this may be useful for some kinds of queries. For example, if you're querying by language, you may want to index that so there is no need to compute the language each time.
CREATE INDEX files_language_idx ON files USING pilosa (language(file_path, blob_content))
Once you have the index in place, gitbase only looks for the rows with the values matching your conditions.
But beware, even if you have an index it's possible that gitbase will not use it. These are the forms an expression must have to make sure the index will be used.
<indexed expression> = <evaluable expression>
<indexed expression> < <evaluable expression>
<indexed expression> > <evaluable expression>
<indexed expression> <= <evaluable expression>
<indexed expression> >= <evaluable expression>
<indexed expression> != <evaluable expression>
<indexed expression> IN <evaluable expression>
<indexed expression> BETWEEN <evaluable expression> AND <evaluable expression>
<indexed expression> is the expression that was indexed when the index was created, in the previous case that would be
<evaluable expression> is any expression that can be evaluated without using the current row. For example, a literal (
"foo"), a function that takes no column arguments (
SUBSTRING("foo", 1)), etc.
So, if you have this query, the index would be used.
SELECT file_path FROM files WHERE language(file_path, blob_content) = 'Go'
But these queries would not use the index.
SELECT file_path FROM files WHERE language(file_path, blob_content) = SUBSTRING(file_path, 0, 2)
SELECT file_path FROM files WHERE language(file_path, blob_content) LIKE 'G_'
Note that when you use an index on multiple columns, there is a limitation (that may change in the future) that requires all columns sharing the same operation.
For example, let's make an index on two columns.
CREATE INDEX commits_multi_idx ON commits USING pilosa (committer_name, committer_email)
This query would use the index.
SELECT * FROM commits WHERE committer_name = 'John Doe' AND committer_email = 'firstname.lastname@example.org'
These, however, would not use the index.
SELECT * FROM commits WHERE committer_name = 'John Doe'
All columns in an index need to be present in the filters.
SELECT * FROM commits WHERE committer_name = 'John Doe' AND committer_email != 'email@example.com'
All the columns need to use the same operation. In this case, one is using
= and the other
!=. This is a current limitation that will be removed in the future.
There is an optimization done inside gitbase called squashed tables. Instead of reading all the data from the tables and then performing the join, a squashed table is the union of several tables in which the output of a table is generated using the output of the previous one.
Imagine we want to join
files. Without the squashed joins we would read all
commit_files and all
files. Then, we would join all these rows. This is an incredibly expensive operation for large repositories. With squashed tables, however, we read all
commits, then, for each commit we generate the
commit_files for that commit and then for each commit file we generate the
files for them. This has two advantages:
Filters are applied early on, which reduces the amount of data that needs to be read. If you filtered commits by a particular author in our previous example, only commit files, and thus files, by that commit author would be read, instead of all of them.
It works with raw git objects, not database rows, which makes it way more performant since there is no need to serialize and deserialize.
As a result, your query could be orders of magnitude faster.
Only works per repository. This optimisation is built on top of some premises, one of them is the fact that all tables are joined by
This query will get squashed, because
NATURAL JOIN makes sure all columns with equal names are used in the join.
SELECT * FROM refs NATURAL JOIN ref_commits NATURAL JOIN commits
This query, however, will not be squashed.
SELECT * FROM refs rINNER JOIN ref_commits rc ON r.ref_name = rc.ref_nameINNER JOIN commits c ON rc.commit_hash = c.commit_hash
It requires some filters to be present in order to perform the squash.
This query will be squashed.
SELECT * FROM commit_files NATURAL JOIN files
This query will not be squashed, as the join between
files requires more filters to be squashed.
SELECT * FROM commit_files cfINNER JOIN files f ON cf.file_path = f.file_path
TIP: we suggest always using
NATURAL JOIN for joining tables, since it's less verbose and already satisfies all the filters for squashing tables. The only exception to this advice is when joining
NATURAL JOIN between
ref_commits will only get the HEAD commit of the reference. The same happens with
You can find the full list of conditions that need to be met for the squash to be applied here.
Only works if the tables joined follow a hierarchy. Joinin
files does not work, or joining
files. It needs to follow one of the hierarchies of tables.
repositories -> refs -> ref_commits -> commits -> commit_trees -> tree_entries -> blobsrepositories -> refs -> ref_commits -> commits -> commit_blobs -> blobsrepositories -> refs -> ref_commits -> commits -> commit_files -> filesrepositories -> remotes -> refs -> (any of the other hierarchies)
As long as the tables you join are a subset of any of these hierarchies, it will be applied, provided you gave the proper filters. If only some part follows the hierarchy, the leftmost squash will be performed.
For example, if we join
remotes, and then
blobs, the result will be a squashed table of
remotes and a regular join with
blobs. The rule will try to squash as many tables as possible.
You can check if the squash optimisation was applied to your query by using the
DESCRIBE FORMAT=TREE <your query>
This will pretty-print the analyzed tree of your query. If you see a node named
SquashedTable it means your query was squashed, otherwise some part of your query is not squashable or a filter might be missing.
T1.repository_id = T2.repository_id: all tables must be joined by
refs.ref_name = ref_commits.ref_name
refs.commit_hash = ref_commits.commit_hash (only if you want to get just the HEAD commit)
refs.commit_hash = commits.commit_hash
refs.commit_hash = commit_trees.commit_hash
refs.commit_hash = commit_blobs.commit_hash
refs.commit_hash = commit_files.commit_hash
ref_commits.commit_hash = commits.commit_hash
ref_commits.commit_hash = commit_trees.commit_hash
ref_commits.commit_hash = commit_blobs.commit_hash
ref_commits.commit_hash = commit_files.commit_hash
commits.tree_hash = commit_files.tree_hash (only if you want just the main commit tree files)
commits.commit_hash = commit_trees.commit_hash
commits.tree_hash = commit_trees.tree_hash (only if you want just the main commit tree)
commits.commit_hash = commit_blobs.commit_hash
commits.commit_hash = commit_files.commit_hash
commits.tree_hash = tree_entries.tree_hash
commit_trees.tree_hash = tree_entries.tree_hash
commit_blobs.blob_hash = blobs.blob_hash
tree_entries.blob_hash = blobs.blob_hash
commit_files.file_path = files.file_path
commit_files.tree_hash = files.tree_hash
commit_files.blob_hash = files.blob_hash