Insert Data at High Speed
Table of Contents
This chapter presents several guidelines on how to insert data at a high speed. In the case of the time series data, the volume of generated per second is at least thousands of data up to hundreds of thousands of data. Therefore, the load of the system has to be optimized to achieve the desired performance. They are divided into optimized utilization of the CPU, the optimization of I/O performance, and the optimization of the index thread. The value of the property which are provided by Machbase needs to be modified for the system.
When a large amount of data is continuously inserted, it is stored into the memory, and then I/O thread performs the work in the background. When I/O is completed, memory for data loading is released. However, the speed of disk I/O is 100 times or even 1000 times slower compared to the memory. As a result, when loading data at a high-speed, the speed of accumulating data in the memory is a lot faster than writing data on the disk. Thus, the assigned memory can be increased exponentially as it is not able to transfer data from the system to the disk. For checking memory increase, the system utility commands such as "top" or "htop", or V$SYSMEM provided in Machbase can be used.
SELECT SUM(USAGE) FROM V$SYSMEM;
To know the sum of the memory that machbased is currently using, run the query above.
To solve problem mentioned above, make the speed of disk I/O fast. For this, two options are available.
- Use a fast disk
- Use multiple disks
The first solution is very simple: apply a high-performance disk such SSD (Solid State Disk). Even though its price is 10 or 20 times higher than SATA (Serial ATA) HDD, it is massive when it comes to capacity. The second solution is to apply multiple SATA disks. It is relatively fast and cheaper than SSD and is suitable for processing big data. In Machbase, to use multi-disk function, you have to install the normal SATA disk in the system and generate a tablespace composed of multiple disks. When you create a table, specify the created tablespace. The data file of the created table in this tablespace be distributed and stored on disks of tablespace as round-robin scheduling to perform I/O and let the data be written to the disk.
Suppose there are five disks installed in the following paths.
And each disk generates directories where data are stored. See below.
It creates a tablespace by using the following syntax. In the example below, PARALLEL_IO of each disk is the maximum number of I/O. If you don't specify PARALLEL_IO, the default value is used. The default value is the value of existing DISK_IO_THREAD_COUNT in machbase.conf.
CREATE TABLESPACE tbs1 DATADISK disk1 (DISK_PATH = "/media/disk1/machdata", PARALLEL_IO = 8), disk2 (DISK_PATH = "/media/disk2/machdata", PARALLEL_IO = 8), disk3 (DISK_PATH = "/media/disk3/machdata", PARALLEL_IO = 8), disk4 (DISK_PATH = "/media/disk4/machdata", PARALLEL_IO = 8), disk5 (DISK_PATH = "/media/disk5/machdata", PARALLEL_IO = 8);
And it creates a table for the tablespace created above.
CREATE TABLE T1 (c1 INTEGER, c2 INTEGER) TABLESPACE tbs1;
After conducting the syntax above, load the data to the table, data are distributed and stored on a specified disk in the tablespace.
This can only be used with a full license. Without the license, you can use only one disk.
In general, buffered I/O is used as a means of the disk I/O. When a user requests to write to the OS Kernel, it records it on the page of buffer cache area was allocated by the Kernel, and informs the user that it has been written. Actually, disk write doesn't occur in reality. Periodically, the Kernel transfers data on disks in the background asynchronously. If read has a requested data in the buffer cache of OS Kernel, you can fetch the requested data right away without disk I/O. As a result, the performance of the system can be improved through buffered I/O. However, it is not suitable for the applications that process a large amount of data like a database. You have to manage the extra buffer within the applications while considering the characteristics of data and user's data access pattern. In the case of buffered I/O, it causes a problem of using memory twice for caching the same data of Kernel and application.
There are many cases that a large amount of data comes into Machbase constantly and most of access patterns is to extract statistics. Therefore, if the same method of buffer management for conventional DBMSs, data cannot be transferred to the disk quickly enough, and some delay was caused for securing memory spaces for new data. It degrades the performance of system loading as a whole. In order to prevent this issue, Machbase transfers the data on disks as soon as data arrives, and releases the allocated memory for new data. In this regard, Machbase provides direct I/O and it writes from user memory to disk directly unlike buffered I/O. Therefore, it doesn't need to be copied in the Kernel buffer area. It reduces the overhead in the Kernel and the speed goes up. To use this function, set the property in machbase.conf to 1 prior to start Machbase. Then, start up Machbase.
DISK_TABLESPACE_DIRECT_IO = 1
In addition, there is DISK_TABLESPACE_DIRECT_IO_FSYNC. To improve performance by allowing a certain degree of data loss when power failure occurs, set DISK_TABLESPACE_DIRECT_IO_FSYNC as follows.
DISK_TABLESPACE_DIRECT_IO_FSYNC = 0