Friday, October 15, 2010

ICE 3.5 Beta released

ICE 3.5 beta was released yesterday. The beta includes enchanced memory management, durability improvements and improved performance on lookup columns.
Also a lot for minor bug have been fixed, especially issues which caused a server crash.
Unfortunately the source hasn't been released so I can't take more technical look on the changes, but the changes described in the released notes, seems to be a overall improvement.
I haven't got time to test it yet, but as soon as I have time I'll post a comparison between the 3.5 beta and 3.4.2.

Wednesday, September 8, 2010

Percona compares Infobright and InifiDB

I've just noticed that Percona has compared ICE against InifiDB. Percona are in my opinion one of the bests experts in MySQL and the available storage engines, and are very active in the MySQL community.
They have made a 22 pages long comparison of the two storage engine, where they have looked at:

  • DDL and datatype support
  • Time to setup database and data loading
  • Size of the loaded data and compression
  • Ease of installation and security
  • Queries over large datasets
They used the ICE 3.3.2 beta and InfiniDB 1.5 GA.
Here is a very short summary of their conclusions, with focus on Infobright.

DDL and datatype support
Infobright supports more a number of MySQL data types which are not supported by InfiniDB, fx:
  • Year
  • Time
  • Tinytext
  • NOT NULL
When the use of these data types were necessary they used another data type which could hold the data instead.
Both database did not supply very good error messages when unsupported data types were used.

Loading Data
Even though ICE supports the MySQL "LOAD DATA INFILE" syntax, it is not compatible with the default MySQL settings, and as does who have tried ICE know, this can be very confusing in the beginning. But compared to InfiniDB loading data was much easier.
InfiniDB very about 147% faster to load a 900 GB dataset, and were in average 346 seconds faster per file.

Compression
InfiniDB doesn't compress data and therefore data size are much bigger than ICE. The data size in ICE were about 13% of the source data size.

Installation and security
Again ICE scored much better than InfiniDB both in security (5/5) and ease of installation and use (4/5).

Queries
This is the most interesting part, many queries (15/29) could not be executed on InfiniDB, as all aggregation functions aren't supported.
ICE were faster in the InfiniDB in 26 out of 29 queries, but as stated above not all queries could be executed by InfiniDB.

Conclusion
Percona thinks ICE performs better, are easier to use and give more accurate results than InfiniDB.

The full comparison can be download at the Infobright Whitepaper Resource Library.

Sunday, March 14, 2010

Infobright ICE 3.3.2 Beta released

Infobright ICE 3.3.2 Beta has been released. It adds full support for storing and querying UTF-8 data. The release notes can be found here, and ICE 3.3.2 Beta can be downloaded here.
As this is a beta it should be used in production systems yet, but I can only recommend to try it out if you need UTF-8 support.
As I did the last time I took a look at the differences between the two latest version, 3.3.1 and 3.3.2 Beta.
Most are UTF-8 stuff, but there are couple of other interesting changes which isn't fully implemented in the beta.

It looks like a new configuration parameter are about to be added, UseMySQLImportExportDefaults. It is not implemented yet, meaning it won't do anything different if enabled in the configuration file. But it sounds like it will use the default MySQL import and export settings, instead of the current Infobright settings.

It also looks like a lot of other export/import improvements are under way. For example
  • Ignore lines
  • Load data local infile
  • Value list elements
  • Lock options
  • Optionally enclosed
  • Lines starting by
Some of these options may never be implemented in Infobright, but at least now you get a warning if the options are used, saying that they aren't supported and will be ignored. But the fact that they now appear in the source code, are a step in the right direction.

There have also been some improvements to sorting, and the caching mechanisms.

Monday, March 1, 2010

ICE LOAD DATA LOCAL INFILE

Currently no version of ICE supports the LOAD DATA LOCAL INFILE statement. Although some ETL tools provide a workaround for, see http://www.infobright.org/Forums/viewthread/1123/, by using named pipes and a program installed on the server.
Those who don't use any ETL tool, cannot do anything else than copying the file to the server and load in from there.
I've created a patch which enables remote data loading, but currently it only works on Linux. It works by using named pipes.
When I have more time I'll make it to work on Windows too.

Patch have to applied in the root of Infobright source code, and are tested in ICE 3.3.1 on Linux, it may work on other UNIX like systems, but it won't work on Windows.

Update:
I have discovered a bug in the original patch. The bug allows the data loading to end prematurely, meaning that all rows aren't going to be imported. I've created a temporary fix, until I figure out why this happens.
The fix can be downloaded here.

Patch file:
loadDataLocal331.patch (9 KB)

Thursday, February 25, 2010

ICE ini settings

The brighthouse.ini file contains all Infobright specific settings. But what settings are actually available, and what do they do? When I looked at the source code, I saw settings I had no idea existed.
The list below applies to ICE 3.3.1.

BufferingLevel
Default value: 2
Used when the loader heap can't hold the all the data, and defines the size of the buffer. The size can be calculated like this LoaderMainHeapSize / BufferingLevel.

CacheFolder
Default value: "cache"
As the name implies this defines the location of the Infobright cache folder.

CachingLevel
Default value: 1
Used to define the type of caching level used on the uncompressed heap. If the value are 1 or 2, Infobright removes data packs which aren't used from the heap, if it runs out of space. This is not done if the value are anything else than 1 or 2. If this isn't enough or another value are defined, it tries to release memory and it tries to compress the data and move it to the compressed heap.

LicenseFile
Default value: ""
This points to the Infobright license file, and are only used in the IEE evaluation edition.

ClusterSize
Default value: 2000
The maximum size of a data file, in MB.

ControlMessages
Default value: 0
Enables control messages, which usually needed when investigating performance. The value 2 enables it, values above 2, enables it and also produces timestamps.

HugeFileDir
Default value: ""
Only used on non-Windows machines. Used to specify the location of a memory mapped file containing the server main heap. If the setting aren't defined, no memory mapped file are used, instead it is placed directly in memory.

InternalMessages
Default value: false
Used to print FET - Function Execution Times - to a log file called development.log. Only used if compiled with --with-FET.

KNFolder
Default value: BH_RSI_Repository
The folder containing the Knowledge Grid data.

KNLevel
Default value: 99
If the value are 0, Knowledge Grid aren't used. values above 0 enables the Knowledge Grid, but doesn't seem to do anything else in ICE.

LoaderMainHeapSize
Default value: 320
Size of the memory heap in the loader process, in MB.

LoaderSaveThreadNumber
Default value: 16
Number of parallel load threads. Only used in IEE.

PushDown
Default value: true
Enables or disables Engine Condition Pushdown. Should not be disabled.

ServerCompressedHeapSize
Default value: 256
Size of the compressed memory heap in the server process, in MB.

ServerMainHeapSize
Default value: 600
Size of the uncompressed memory heap in the server process, in MB.

UseCharSet
Default value: false
Used to enable or disable UTF8 support. As this isn't fully implemented in the current version, 3.3.1, it shouldn't be enabled.

AllowMySQLQueryPath
Default value: false
Allows queries to fallback to standard MySQL execution if necessary.

Wednesday, February 17, 2010

Join algorithms

After I looked through the different sorter algorithms here, I've decided to take a look at the joining algorithms. They're a bit more complicated, but I was a very informative process.
First we need to understand the difference between a simple and a complex join.

Simple Join
A simple join is a join which only uses two tables, and isn't a BETWEEN operation.

Complex Join
Is a join which uses more than two tables, or a BETWEEN operation.

Now that we know the difference, we can take a closer look at the 3 different join algorithms.
  • Hash join
  • Sort join
  • General join
Hash join
This algorithm can only be used if it is a simple join and the condition is a equal operation. It uses a temporary table to store the hash for all key values and the matching tuples for the first table.
If enough memory are available all the hash values are placed in the temporary table, otherwise only the values which fit in memory are placed in the table. Then the values in the second table are iterate to find matching key.
If all key values from the first table weren't placed in the hash table, the next portion of keys are placed in the table, and the second table are iterated again, this is done until all keys have been processed.

Sort join
This algorithm can only be used if it is a simple join and the condition are <=, <, > or >=.
It works by inserting all keys and dimensions, into two sorters, one for each table. Using the Knowledge Grid irrelevant values are removed, and then the keys are sorted. Both sorters are then traversed in parallel and matched. The final step is to check additional constraints before the rows are committed to the output.

General join
This is the general joiner algorithm, and it is used when no other algorithm can be used. It is also the slowest algorithm available, it iterates through all the join dimensions and remove tuples not matching the defined join conditions.

How to determine to algorithm used?
If you enable ControlMessage, you can easily see which algorithm are used. There will be a line like this:
Tuples after inner/outer join noOfdimensions sort/mix/hash/loop noOfTuplesAfterJoin
Sort is the sort join, mix is mixed algorithm (if one fails, and general are used instead), hash is the hash join and loop is the general join.

ICE importer IGNORE LINES patch

The ICE importer in any of the current versions of ICE, newest 3.3.1, doesn't support IGNORE LINES in the LOAD DATA statement. For example, this this data file would fail, as we can't tell it to skip the first line:
Test data
1;2;3
4;5;6
It is possible to issue the LOAD DATA statement with the IGNORE LINES option, but it is ignored by the loader.
It is very easy to implement in the loader, and the attached patched does exactly this.

The patch is tested in ICE 3.3.1, but should also work in earlier versions. It must be applied in the ICE source code root.

I've created a bug report here.

Patch file:
ignoreLines331.patch (4 KB)

Tuesday, February 16, 2010

ICE MySQL Query Cache patch

Infobright ICE doesn't currently use the MySQL Query Cache for queries executed by the Infobright engine.
The attached patch enables the MySQL Query Cache for Infobright queries also.

There are currently a bug report about this here.

Patch file for ICE 3.3.1, should also work for earlier version:
queryCache331.patch (1 KB)

ICE importer default escape character patch

In the current, 3.3.1, and earlier version of ICE there is a problem with the importer. It doesn't handle escape characters in strings correctly. This is because the default escape character in the importer is nothing (ASCII code: 0). This patch sets the default escape character to '\', which is the default in MySQL.
You'll only run into this problem if you haven't define an escape character in the LOAD DATA statement, and try to load a string containing a ".

Patch file for ICE 3.3.1, but should also work in earlier version, has to be applied in the source code root:
escapedCharacter331.patch

ICE RENAME TABLE support

Currently no version of ICE supports the RENAME TABLE statement, but this is about to change this.

At least from version 3.3.0, the RENAME TABLE statement has been almost fully implemented, to only thing missing is connecting the rename function, to the statement. I have know idea why this hasn't been done a long time ago.

I've created a forum post, trying to answer this question, it can be found here.

The patch enables you to give ICE the following statement:
RENAME TABLE oldName TO newName
The attached patch are tested in version 3.3.1, but should also work in 3.3.0, and should be applied in the root ICE source directory.

Patch file:
renameTable331.patch (1 KB)



UPDATE:

If you issue the RENAME TABLE command in the current implementation, the table file structure are corrupted. Therefore do not try to rename a Infobright table without applying this patch. I've create a bug report here.



UPDATE 2:

I got a response to the forum post I created. The answer to why it wasn't implemented was that renaming tables in ICE aren't officially supported, and therefore this wasn't a bug. See the answer here.

ICE importer FIELDS ESCAPED BY patch

None of the Infobright ICE versions support NULLs escaped anything other than the default escape character '\'. This is probably because the default way to handle NULLs in the ICE exporter and importer are an empty field. In this example fields are terminated by ';', and NULLs are indicated the ICE default way.
1;;2
2;2;2
The problem is that ICE importer also supports NULLs defined the default MySQL way:
1;\N;2
2;2;2
But not with any other escape character, like this:
1;~N;2
2;2;2
In ICE you can define a custom NULL indication by setting the BH_NULL session variable, but for some reason this is not handled in the importer.

There are currently a bug report concerning this here.

I've created a patch to solve this "problem", which means that you can define another escape character like this:
LOAD DATA INFILE '/tmp/outfile.csv' INTO TABLE ibtest FIELDS TERMINATED BY '~';

The patch has to be applied in the root of the ICE source code directory.
The patch is tested in ICE 3.3.1, but should also work in any earlier versions.

Patch file:
escapedBy331.patch (2 KB)

Saturday, February 13, 2010

ICE sorter algorithms

In the current version of ICE, 3.3.1, there are 4 types of sorting algorithms, with the internal names:
  • SorterLimit
  • SorterCounting
  • SorterOnePass
  • SorterMultiPass
Which one that are used depends on the Infobright memory settings and type of sort to be done.

Before going in depth about the different algorithms, we need to understand how Infobright determines the size of the keys to order by. The method is actually pretty smart, from the Knowledge Grid Infobright can determine the biggest value of a column, and that value are used to determine the key size. For example a column can be declared as a integer which is 4 bytes, but if the column only contains values less than 255 bytes, the key only has to be 1 byte long, less than 65535, 2 bytes, etc.
The size of a row are calculate the same way as the key length, meaning that we get the smallest possible buffer.

The memory used by the Infobright sorter, are defined by the ServerMainHeapSize in the brighthouse.ini configuration file. The memory allocated to sorting are defined in the following heap size intervals.

ServerMainHeapSize Memory available for sorting
Less than 0.5 GB 64 MB
0.5 - 1.2 GB 128 MB
1.2 - 2.5 GB 256 MB
2.5 - 5 GB 512 MB
5 - 10 GB 1 GB
More than 10 GB 2 GB

SorterLimit
This algorithm uses a single memory buffer.
The criterias for using this algorithm criteria are as follows:
  • The number of rows retrieved must be less than a third of the total number of rows.
  • The number of rows retrieved must be less than a third of the maximum rows allowed in the memory.
To fulfill these requirements you have to define a LIMIT clause.
This algorithm does the sorting on-the-fly, meaning that when the values are loaded into the sort buffer, it is sorted right away, the other algorithms sorts when the values are retrieved. This means that a much smaller buffer are used, but because the sorting occures on-the-fly, it takes longer to sort many rows. Therefore it makes sense that this algorithm are chosen only on small datasets.

SorterCounting
This is a Counting Sort algorithm, which uses two memory buffers.
The criterias for using this algorithm criteria are as follows:
  • The memory must be able to hold twice as many rows as the table contains.
  • The key size must be less than 3 bytes long.
  • If the key is 1 byte long, the total number of rows must be above 1024. And if its 2 bytes long, the number of rows must be above 256000.
As the Counting Sort algorithm are impractical for large ranges, it is only for low-cardinality keys.

SorterOnePass
This algorithm uses either a Bubble SortQuick Sort or a combination of both, and uses a single memory buffer.
The criteria for using this algorithm criteria are as follows:
  • The memory must be able to hold the all the rows in the table.
If we are ordering less than 20 values are Bubble Sort are used. If we are ordering more than 20 values, most of the values, are sorted using Quick Sort, but depending on the distribution of values in the table smaller intervals are sorted using Bubble Sort.

SorterMultiPass
This is uses same sorting algorithm as SorterOnePass, but uses multiple buffers.
The criteria for using this algorithm criteria are as follows:
  • The memory cannot hold all the rows in the memory.
The number of buffers used are defined by the available memory, buffer has the size of the available memory, so if three times the memory are needed, three buffers are created. When a buffer has been filled, it is sorted and save to disk. When a value are retrieved the current buffer are sorted, and then the values, both from disk and memory, are returned.
Because of the multiple buffers and sort passes, this is the most costly sorting algorithm.

How to determine which algorithm are used?
It is pretty easy to determine which algorithm that are going to be used. Just look at the key length, row size, the number of rows in the table, the number of rows to get and the available memory. If ControlMessages have been enabled in the brighthouse.ini configuration file. You can also look at the contents of bh.err log file, and look for the Sorter initialized line, to get the number of rows, the key size and total row size. The line could look like this:
2010-02-12 20:05:33 [1] Sorter initialized for 28 rows, 1+8 bytes each.
Meaning that the table contains 28 rows, the key is 1 byte long and the total row size are 1+8 bytes long. With the information in hand, look at the criterias above to figure which algorithm are chosen.

Friday, February 12, 2010

ICE 3.3.1 released follow up - The biggest little-known secret

Before updating my Infobright patches, I decided to to a closer a look at what actually changed from 3.3.0 to 3.3.1, and to my surprisem, a new very big feature has been implemented, namely UTF-8 support.

Why would they implement something like this, even partially, without telling the public?
Even a tool called CharSetMigrationTool have been included in the source distribution, don't know about the binary distributions. The tool seems to update the Knowledge Grid to support UTF-8.

Just by looking at the source code, it seems like UTF-8 support has been fully implemented with a few exceptions, like CMAPs for UTF8. CMAPs are explained in more detail in this forum post.

On the ICE forum I posted the question, has it been implemented or not, the post can be found here.

Other than that, no untold changes seems to be hiding in 3.3.1 release.

UPDATE: One of the Infobright developers replied to my forum post, and he confirms that UTF-8 support has been partially implemented in ICE 3.3.1. The next release of ICE should contain UTF-8 support for the ICE loader process. See the reply here.

Thursday, February 4, 2010

ICE 3.3.1 released

Today Infobright 3.3.1 has been released.
The most important change are that MySQL have been upgraded from 5.1.14 to 5.1.40 GA, which fixes a bunch of different bugs, both performance, query syntax, etc., in the MySQL engine.
Because of the MySQL upgrade it is necessary to run the MySQL Updater program, as the data file structure has been changed in MySQL 5.1.40 GA. Upgrade instructions are available here:
Windows Upgrade Instructions
Linux Upgrade Instructions
Tar Upgrade Instructions
There are also a known issue with the 3.3.1 version.
Please note: There is a known issue with 3.3.1. A server crash may occur for queries of the form:
SELECT distinct(a1), max(a2) FROM t1 GROUP BY a3;
Where a1, a2, a3 may or may not be the same. It is caused by the use of max and min and the distinct clause, and may be easily worked around by removing the distinct and adding an additional clause to the GROUP BY.
This issue has been resolved and will be included in the next release. A full list of known and fixed issues is included at the end of this document.
All the information about the new release can be found here and can be downloaded here.
The patches previously posted on this blog, will be updated to work in 3.3.1. The updated patches will be added to the old post.

Wednesday, February 3, 2010

ICE importer LINES TERMINATED BY patch

As a continuation of my previous post, Infobright also doesn't allow importing with a user defined line terminator.There are currently a bug report here.
This patch fixes this, please note that the patch from this post, http://infobright.blogspot.com/2010/02/infobright-ice-exporter-lines.html, have to be applied first.

This patch allows you to use the Infobright importer while defining a line terminator, like this:
LOAD DATA INFILE '/tmp/outfile.csv' INTO TABLE ibtest LINES TERMINATED BY '\t'
By default the Infobright importer supports \r\n and \n as line terminators, and figures it out automatically, but if another line terminator are used, importing the file will fail.

The patch has to be applied in the Infobright source root.

Patch file:
importLinesTerminated.patch (4 KB)

UPDATE:
Patch file for ICE 3.3.1:
importLinesTerminatedBy331.patch (4 KB)

ICE exporter LINES TERMINATED BY patch

In all current versions, newest 3.3.0, of Infobright you cannot define LINES TERMINATED BY when exporting data to a file. Well, you can define it but it isn't used when the Infobright exporter are used.
The default line terminator are \r\n, which works great, but sometimes you want to use another line terminator, which are not supported in any of the current versions of Infobright.

There are currently two bug reports concerning this, http://bugs.infobright.org/ticket/1235 and http://bugs.infobright.org/ticket/1072.

This patch fixes this, it works in Infobright 3.3.0 and allows you to define LINES TERMINATED BY like this:
SELECT * FROM ibtest INTO OUTFILE '/tmp/outfile.csv' LINES TERMINATED BY '\n'
The patch has to be applied at the root of the Infobright source.

Patch file:
exporterLinesTerminated.patch (8 KB)

UPDATE:
Patch file for ICE 3.3.1:
exporterLinesTerminated331.patch (8 KB)

Monday, January 25, 2010

Default ICE NULL string patch

One thing about Infobright that annoys me a lot, are their exporters default way to indicate NULL values, which is nothing, NULL are indicated by no character. This poses a viewability problem when fields are terminated by for example tabs, then you cannot easily see the null values.

To get around this problem you can use the session variable BH_NULL, and set it to fx. the default MySQL way, \N. But this has to be done for every session/connection.

As I did not want to do this, I created a trivial patch which sets the default NULL string to \N, as in standard MySQL.

The patch can be downloaded below, and has to be applied in the root of the Infobright source files, and works of Infobright 3.3.0.

Patch file:
DefaultNullStr.patch (1 KB)

UPDATE: The patch also works for ICE 3.3.1

Sunday, January 24, 2010

Default ICE exporter patch

When you export files in the current Infobright version, 3.3.0 and earlier. There are some things you should be aware of. The default exporter are the standard MySQL exporter, but this exporter are not compatible with the default Infobright importer.

For example if you do this, you may run into problems, because of the differences in the exporter and importer:
SELECT * FROM infobrightTable INTO OUTFILE '/tmp/outfile.txt';
LOAD DATA INFILE '/tmp/outfile.txt' INTO TABLE infobrightTable;
Currently there is a bug report about this here. Unfortunately, you'll have to create a user to access the contents.

As described in the bug report, and in other places, you'll have to set the BH_DATAFORMAT session variable to txt_variable, to use the Infobright exporter by default, otherwise you'll may run into format compability problems.

The 'default' exporter will only be used if you are accessing Infobright tables, therefore in my opinion the default exporter should be the Infobright exporter, defined by BH_DATAFORMAT=txt_variable, but at the moment it isn't.

To work around this I've created a very simple patch, which does exactly this. What this patch does is that instead of using the default MySQL exporter, when the BH_DATAFORMAT variable aren't defined, it uses the Infobright exporter instead.

The patch file can be download at the bottom this post, and works for the current version 3.3.0. It has to applied to the file src/storage/brighthouse/core/RCEngine_tools.cpp directly.

Patch file for 3.3.0:
DefaultTxtVariable.patch (1 KB)

UPDATE:
Patch file for 3.3.1:
DefaultTxtVariable331.patch (1 KB)

MySQL and Infobright

At my work place we have used the Infobright Community Edition, http://www.infobright.org/, for a while now, and with great success.

For those who don't know what Infobright is, I recommend that you take a look at this page, What is ICE?, ICE stands for Infobright Community Edition, which is the open source version.

One of Infobrights great strengths are compression, which is top-notch. And for certain analytic queries it performs very, very good.

But there are certainly room for improvements, fx. as Infobright uses its own optimizer, not the default MySQL optimizer, certain queries aren't optimized very well yet, and some aren't supported at all. Unsupported queries will fall back to the standard MySQL optimizer, which doesn't work well with Infobright tables.

Also they are very slow to implement contributed patches, because of this, I've decided to post patches in this blog. Patches which are committed to they're bug tracker, but also patches which are developed specifically for use at my company.