Xharze's Infobright Blog: February 2010

Thursday, February 25, 2010

ICE ini settings

The brighthouse.ini file contains all Infobright specific settings. But what settings are actually available, and what do they do? When I looked at the source code, I saw settings I had no idea existed.
The list below applies to ICE 3.3.1.

BufferingLevel
Default value: 2
Used when the loader heap can't hold the all the data, and defines the size of the buffer. The size can be calculated like this LoaderMainHeapSize / BufferingLevel.

CacheFolder
Default value: "cache"
As the name implies this defines the location of the Infobright cache folder.

CachingLevel
Default value: 1
Used to define the type of caching level used on the uncompressed heap. If the value are 1 or 2, Infobright removes data packs which aren't used from the heap, if it runs out of space. This is not done if the value are anything else than 1 or 2. If this isn't enough or another value are defined, it tries to release memory and it tries to compress the data and move it to the compressed heap.

LicenseFile
Default value: ""
This points to the Infobright license file, and are only used in the IEE evaluation edition.

ClusterSize
Default value: 2000
The maximum size of a data file, in MB.

ControlMessages
Default value: 0
Enables control messages, which usually needed when investigating performance. The value 2 enables it, values above 2, enables it and also produces timestamps.

HugeFileDir
Default value: ""
Only used on non-Windows machines. Used to specify the location of a memory mapped file containing the server main heap. If the setting aren't defined, no memory mapped file are used, instead it is placed directly in memory.

InternalMessages
Default value: false
Used to print FET - Function Execution Times - to a log file called development.log. Only used if compiled with --with-FET.

KNFolder
Default value: BH_RSI_Repository
The folder containing the Knowledge Grid data.

KNLevel
Default value: 99
If the value are 0, Knowledge Grid aren't used. values above 0 enables the Knowledge Grid, but doesn't seem to do anything else in ICE.

LoaderMainHeapSize
Default value: 320
Size of the memory heap in the loader process, in MB.

LoaderSaveThreadNumber
Default value: 16
Number of parallel load threads. Only used in IEE.

PushDown
Default value: true
Enables or disables Engine Condition Pushdown. Should not be disabled.

ServerCompressedHeapSize
Default value: 256
Size of the compressed memory heap in the server process, in MB.

ServerMainHeapSize
Default value: 600
Size of the uncompressed memory heap in the server process, in MB.

UseCharSet
Default value: false
Used to enable or disable UTF8 support. As this isn't fully implemented in the current version, 3.3.1, it shouldn't be enabled.

AllowMySQLQueryPath
Default value: false
Allows queries to fallback to standard MySQL execution if necessary.

Wednesday, February 17, 2010

Join algorithms

After I looked through the different sorter algorithms here, I've decided to take a look at the joining algorithms. They're a bit more complicated, but I was a very informative process.
First we need to understand the difference between a simple and a complex join.

Simple Join

A simple join is a join which only uses two tables, and isn't a BETWEEN operation.

Complex Join

Is a join which uses more than two tables, or a BETWEEN operation.

Now that we know the difference, we can take a closer look at the 3 different join algorithms.

Hash join
Sort join
General join

Hash join

This algorithm can only be used if it is a simple join and the condition is a equal operation. It uses a temporary table to store the hash for all key values and the matching tuples for the first table.
If enough memory are available all the hash values are placed in the temporary table, otherwise only the values which fit in memory are placed in the table. Then the values in the second table are iterate to find matching key.
If all key values from the first table weren't placed in the hash table, the next portion of keys are placed in the table, and the second table are iterated again, this is done until all keys have been processed.

Sort join
This algorithm can only be used if it is a simple join and the condition are <=, <, > or >=.
It works by inserting all keys and dimensions, into two sorters, one for each table. Using the Knowledge Grid irrelevant values are removed, and then the keys are sorted. Both sorters are then traversed in parallel and matched. The final step is to check additional constraints before the rows are committed to the output.

General join

This is the general joiner algorithm, and it is used when no other algorithm can be used. It is also the slowest algorithm available, it iterates through all the join dimensions and remove tuples not matching the defined join conditions.

How to determine to algorithm used?
If you enable ControlMessage, you can easily see which algorithm are used. There will be a line like this:
Tuples after inner/outer join noOfdimensions sort/mix/hash/loop noOfTuplesAfterJoin
Sort is the sort join, mix is mixed algorithm (if one fails, and general are used instead), hash is the hash join and loop is the general join.

ICE importer IGNORE LINES patch

The ICE importer in any of the current versions of ICE, newest 3.3.1, doesn't support IGNORE LINES in the LOAD DATA statement. For example, this this data file would fail, as we can't tell it to skip the first line:
Test data

1;2;3
4;5;6

It is possible to issue the LOAD DATA statement with the IGNORE LINES option, but it is ignored by the loader.
It is very easy to implement in the loader, and the attached patched does exactly this.

The patch is tested in ICE 3.3.1, but should also work in earlier versions. It must be applied in the ICE source code root.

I've created a bug report here.

Patch file:
ignoreLines331.patch (4 KB)

Tuesday, February 16, 2010

ICE MySQL Query Cache patch

Infobright ICE doesn't currently use the MySQL Query Cache for queries executed by the Infobright engine.
The attached patch enables the MySQL Query Cache for Infobright queries also.

There are currently a bug report about this here.

Patch file for ICE 3.3.1, should also work for earlier version:
queryCache331.patch (1 KB)

ICE importer default escape character patch

In the current, 3.3.1, and earlier version of ICE there is a problem with the importer. It doesn't handle escape characters in strings correctly. This is because the default escape character in the importer is nothing (ASCII code: 0). This patch sets the default escape character to '\', which is the default in MySQL.
You'll only run into this problem if you haven't define an escape character in the LOAD DATA statement, and try to load a string containing a ".

Patch file for ICE 3.3.1, but should also work in earlier version, has to be applied in the source code root:
escapedCharacter331.patch

ICE RENAME TABLE support

Currently no version of ICE supports the RENAME TABLE statement, but this is about to change this.

At least from version 3.3.0, the RENAME TABLE statement has been almost fully implemented, to only thing missing is connecting the rename function, to the statement. I have know idea why this hasn't been done a long time ago.

I've created a forum post, trying to answer this question, it can be found here.

The patch enables you to give ICE the following statement:

RENAME TABLE oldName TO newName

The attached patch are tested in version 3.3.1, but should also work in 3.3.0, and should be applied in the root ICE source directory.

Patch file:
renameTable331.patch (1 KB)

UPDATE:

If you issue the RENAME TABLE command in the current implementation, the table file structure are corrupted. Therefore do not try to rename a Infobright table without applying this patch. I've create a bug report here.

UPDATE 2:

I got a response to the forum post I created. The answer to why it wasn't implemented was that renaming tables in ICE aren't officially supported, and therefore this wasn't a bug. See the answer here.

ICE importer FIELDS ESCAPED BY patch

None of the Infobright ICE versions support NULLs escaped anything other than the default escape character '\'. This is probably because the default way to handle NULLs in the ICE exporter and importer are an empty field. In this example fields are terminated by ';', and NULLs are indicated the ICE default way.

1;;2
2;2;2

The problem is that ICE importer also supports NULLs defined the default MySQL way:

1;\N;2
2;2;2

But not with any other escape character, like this:

1;~N;2
2;2;2

In ICE you can define a custom NULL indication by setting the BH_NULL session variable, but for some reason this is not handled in the importer.

There are currently a bug report concerning this here.

I've created a patch to solve this "problem", which means that you can define another escape character like this:
LOAD DATA INFILE '/tmp/outfile.csv' INTO TABLE ibtest FIELDS TERMINATED BY '~';

The patch has to be applied in the root of the ICE source code directory.
The patch is tested in ICE 3.3.1, but should also work in any earlier versions.

Patch file:
escapedBy331.patch (2 KB)

Saturday, February 13, 2010

ICE sorter algorithms

In the current version of ICE, 3.3.1, there are 4 types of sorting algorithms, with the internal names:

SorterLimit
SorterCounting
SorterOnePass
SorterMultiPass

Which one that are used depends on the Infobright memory settings and type of sort to be done.

Before going in depth about the different algorithms, we need to understand how Infobright determines the size of the keys to order by. The method is actually pretty smart, from the Knowledge Grid Infobright can determine the biggest value of a column, and that value are used to determine the key size. For example a column can be declared as a integer which is 4 bytes, but if the column only contains values less than 255 bytes, the key only has to be 1 byte long, less than 65535, 2 bytes, etc.
The size of a row are calculate the same way as the key length, meaning that we get the smallest possible buffer.

The memory used by the Infobright sorter, are defined by the ServerMainHeapSize in the brighthouse.ini configuration file. The memory allocated to sorting are defined in the following heap size intervals.

ServerMainHeapSize	Memory available for sorting
Less than 0.5 GB	64 MB
0.5 - 1.2 GB	128 MB
1.2 - 2.5 GB	256 MB
2.5 - 5 GB	512 MB
5 - 10 GB	1 GB
More than 10 GB	2 GB

SorterLimit
This algorithm uses a single memory buffer.
The criterias for using this algorithm criteria are as follows:

The number of rows retrieved must be less than a third of the total number of rows.
The number of rows retrieved must be less than a third of the maximum rows allowed in the memory.

To fulfill these requirements you have to define a LIMIT clause.

This algorithm does the sorting on-the-fly, meaning that when the values are loaded into the sort buffer, it is sorted right away, the other algorithms sorts when the values are retrieved. This means that a much smaller buffer are used, but because the sorting occures on-the-fly, it takes longer to sort many rows. Therefore it makes sense that this algorithm are chosen only on small datasets.

SorterCounting

This is a Counting Sort algorithm, which uses two memory buffers.

The criterias for using this algorithm criteria are as follows:

The memory must be able to hold twice as many rows as the table contains.
The key size must be less than 3 bytes long.
If the key is 1 byte long, the total number of rows must be above 1024. And if its 2 bytes long, the number of rows must be above 256000.

As the Counting Sort algorithm are impractical for large ranges, it is only for low-cardinality keys.

SorterOnePass
This algorithm uses either a Bubble Sort, Quick Sort or a combination of both, and uses a single memory buffer.
The criteria for using this algorithm criteria are as follows:

The memory must be able to hold the all the rows in the table.

If we are ordering less than 20 values are Bubble Sort are used. If we are ordering more than 20 values, most of the values, are sorted using Quick Sort, but depending on the distribution of values in the table smaller intervals are sorted using Bubble Sort.

SorterMultiPass
This is uses same sorting algorithm as SorterOnePass, but uses multiple buffers.
The criteria for using this algorithm criteria are as follows:

The memory cannot hold all the rows in the memory.

The number of buffers used are defined by the available memory, buffer has the size of the available memory, so if three times the memory are needed, three buffers are created. When a buffer has been filled, it is sorted and save to disk. When a value are retrieved the current buffer are sorted, and then the values, both from disk and memory, are returned.
Because of the multiple buffers and sort passes, this is the most costly sorting algorithm.

How to determine which algorithm are used?
It is pretty easy to determine which algorithm that are going to be used. Just look at the key length, row size, the number of rows in the table, the number of rows to get and the available memory. If ControlMessages have been enabled in the brighthouse.ini configuration file. You can also look at the contents of bh.err log file, and look for the Sorter initialized line, to get the number of rows, the key size and total row size. The line could look like this:

2010-02-12 20:05:33 [1] Sorter initialized for 28 rows, 1+8 bytes each.

Meaning that the table contains 28 rows, the key is 1 byte long and the total row size are 1+8 bytes long. With the information in hand, look at the criterias above to figure which algorithm are chosen.

Friday, February 12, 2010

ICE 3.3.1 released follow up - The biggest little-known secret

Before updating my Infobright patches, I decided to to a closer a look at what actually changed from 3.3.0 to 3.3.1, and to my surprisem, a new very big feature has been implemented, namely UTF-8 support.

Why would they implement something like this, even partially, without telling the public?
Even a tool called CharSetMigrationTool have been included in the source distribution, don't know about the binary distributions. The tool seems to update the Knowledge Grid to support UTF-8.

Just by looking at the source code, it seems like UTF-8 support has been fully implemented with a few exceptions, like CMAPs for UTF8. CMAPs are explained in more detail in this forum post.

On the ICE forum I posted the question, has it been implemented or not, the post can be found here.

Other than that, no untold changes seems to be hiding in 3.3.1 release.

UPDATE: One of the Infobright developers replied to my forum post, and he confirms that UTF-8 support has been partially implemented in ICE 3.3.1. The next release of ICE should contain UTF-8 support for the ICE loader process. See the reply here.

Thursday, February 4, 2010

ICE 3.3.1 released

Today Infobright 3.3.1 has been released.
The most important change are that MySQL have been upgraded from 5.1.14 to 5.1.40 GA, which fixes a bunch of different bugs, both performance, query syntax, etc., in the MySQL engine.
Because of the MySQL upgrade it is necessary to run the MySQL Updater program, as the data file structure has been changed in MySQL 5.1.40 GA. Upgrade instructions are available here:
Windows Upgrade Instructions
Linux Upgrade Instructions
Tar Upgrade Instructions
There are also a known issue with the 3.3.1 version.

Please note: There is a known issue with 3.3.1. A server crash may occur for queries of the form:
SELECT distinct(a1), max(a2) FROM t1 GROUP BY a3;
Where a1, a2, a3 may or may not be the same. It is caused by the use of max and min and the distinct clause, and may be easily worked around by removing the distinct and adding an additional clause to the GROUP BY.
This issue has been resolved and will be included in the next release. A full list of known and fixed issues is included at the end of this document.

All the information about the new release can be found here and can be downloaded here.
The patches previously posted on this blog, will be updated to work in 3.3.1. The updated patches will be added to the old post.

Wednesday, February 3, 2010

ICE importer LINES TERMINATED BY patch

As a continuation of my previous post, Infobright also doesn't allow importing with a user defined line terminator.There are currently a bug report here.
This patch fixes this, please note that the patch from this post, http://infobright.blogspot.com/2010/02/infobright-ice-exporter-lines.html, have to be applied first.

This patch allows you to use the Infobright importer while defining a line terminator, like this:

LOAD DATA INFILE '/tmp/outfile.csv' INTO TABLE ibtest LINES TERMINATED BY '\t'

By default the Infobright importer supports \r\n and \n as line terminators, and figures it out automatically, but if another line terminator are used, importing the file will fail.

The patch has to be applied in the Infobright source root.

Patch file:
importLinesTerminated.patch (4 KB)

UPDATE:
Patch file for ICE 3.3.1:
importLinesTerminatedBy331.patch (4 KB)

ICE exporter LINES TERMINATED BY patch

In all current versions, newest 3.3.0, of Infobright you cannot define LINES TERMINATED BY when exporting data to a file. Well, you can define it but it isn't used when the Infobright exporter are used.
The default line terminator are \r\n, which works great, but sometimes you want to use another line terminator, which are not supported in any of the current versions of Infobright.

There are currently two bug reports concerning this, http://bugs.infobright.org/ticket/1235 and http://bugs.infobright.org/ticket/1072.

This patch fixes this, it works in Infobright 3.3.0 and allows you to define LINES TERMINATED BY like this:

SELECT * FROM ibtest INTO OUTFILE '/tmp/outfile.csv' LINES TERMINATED BY '\n'

The patch has to be applied at the root of the Infobright source.

Patch file:
exporterLinesTerminated.patch (8 KB)

UPDATE:
Patch file for ICE 3.3.1:
exporterLinesTerminated331.patch (8 KB)