Netezza now “IBM PureData® System for Analytics” is a data warehouse appliance for BI and analytics. It is easy to use, scalable and offers high performance while handling transactional data ranging in terabytes. One of the biggest reasons for the high performance of Netezza appliance is the deep integration of hardware, software, and storage, which simplifies and accelerates deployment, delivering commendable business analytics value. As an integrated analytics system, Netezza helps drive the insights required to maintain business competitiveness. Netezza is also an ideal solution for data scientists as it matches their need for accelerated development and deployment. Besides being an optimal option for database management, Netezza deals with a severe space-constraint that hinders its functionality. This blog post covers this shortcoming of Netezza and gives a resolution for it.
Figure 1: General architecture of Netezza
Before we understand Netezza’s shortcomings on disk space, we first need to understand its architecture. In Netezza appliance there are two host for high data availability and they are connected to a number of S-blades, all having a CPU, FPGA, and Memory. The CPU, FPGA, and Memory together form SPU (Snippet Processing Unit).
When we create any table in Netezza the table header is stored on all the SPU and no storage is allocated on disk. Once we start inserting data into table, then according to the distribution key and hashing algorithm, it will go to either some SPU or all the SPUs. If an SPU contains data, one 3MB extent is allocated to the table, irrespective of the data size (less than 3 MB). This 3MB space is then divided into 24 x 128 KB pages (also called a block) in which the data is stored. If the data in the SPU exceeds 3MB then additional 3MB extent is added to the SPU from the available space. There is no upper limit to the data size of the table as long as there is adequate space in the system. There is a zone map for each SPU extent in which it stores min and max value of every data column stored in that SPU extent.
With this default storage allocation of 3MB extents and data slices, a typical Netezza appliance can support up to 66,000 user tables.
Figure 2: Netezza disk arrangement on each SPU
Netezza Zone Maps
The Netezza zone maps are refreshed every time we perform any of the following activities:
- Generate Statistics
- Insert and update
Inside Netezza Appliance
While we only see the data we store in the Netezza appliance tables, the actual addition, deletion, and modification involves several hidden columns. Although, these columns are hidden and work in the background, we can use them to know how the data is arranged inside Netezza appliance. Here are some of the hidden columns and their role in Netezza appliance.
- Datasliceid –dataslice on which this row resides
- _extentid – The extent on which the row resides. _extentid is not unique across the machine, only within each dataslice.
- _pageid – The ID of the page on which the row resides. _pageid is not unique across the machine, only within each dataslice.
- rowid – A unique identifier for the row. If a row is deleted and re-inserted it will get a new rowid.
- Createxid – The transaction id for the created row
- Deletexid – The transaction id for the deleted row
Let us understand the uses of the columns through an example.
SYSTEM.ADMIN(ADMIN)=> delete from EMP where EMP_NAME=’NANDI1′ ;
SYSTEM.ADMIN(ADMIN)=> UPDATE EMP SET EMP_NO=1003 WHERE EMP_NAME=’NANDI’;
SYSTEM.ADMIN(ADMIN)=> Select Datasliceid, _extentid, _pageid, rowid, createxid, deletexid , * from EMP ;
Deleted records are not visible unless we set the show_deleted_records to true at session level.
SYSTEM.ADMIN(ADMIN)=> set show_deleted_records=true ;
SYSTEM.ADMIN(ADMIN)=> Select createxid, deletexid, * from EMP ;
The last command returns all the details for each row ever stored on the table. The reason for this is that any row that is deleted in Netezza is only deleted logically and not physically from disk storage. The transaction id is stored in ‘deletexid’ column of that row and we can check those rows as and when required. Not only this, we can also retrieve those records unless we purge them using groom.
SYSTEM.ADMIN(ADMIN)=> set show_deleted_records=true ;
SYSTEM.ADMIN(ADMIN)=> Select createxid, deletexid, * from EMP where deletexid !=0;
This feature of retrieving deleted data is an added bonus as it lets you check if the right data is deleted. You can also re-insert the data into the main table from the retrieval table. The process for which is – copying the data into the Netezza temp tables, validating the rows, and re-inserting them into the final table.(2 rows)
Data updating process in Netezza is delete + insert; i.e. the row is logically deleted and the data is again added into the table as a new row. The createxid and deletexid of the row, however, remains unchanged.
The deleted data though, continues to occupy space on the hard disk. If we wish to permanently delete all the unwanted records and free the disk space, we need to purge all the records using groom. Grooming the tables periodically helps release the preoccupied space by deleted data items, updating the statistics and zone map for better performance of the Netezza system.
Groom and its Impact
Not checking and running groom in the Netezza appliance can result in space complications. Let us understand through a case study how not running groom properly in Netezza system worsens the situation.
DBA team is facing Netezza Disk space full event alerts in production (More than 90% space occupied alert).
NPS NZXYZ-H1: one occurrences of Hardware Disk Full from 06-Mar-18 07:48:08 MST over two minutes.
For which DBA and ETL team need to search for unwanted and duplicate tables, which are not in use or were stored earlier for SCR backup, or were backed up to resolve other data issues. They will drop these tables to release some spaces in the system for general operations like loading etc.
Figure 3: Netezza used and free space in a Netezza system with fragmented tables
In-depth analysis of storage and groom running on system
DBA team studied Netezza space location in the data file in detail and studied extent level, page level and record level allocation of space as per table space requirements. They found that even if a single record exists in the extent it occupies all that extent space and does not releases that space even after grooming the table at extent level. However, when they groom the table on record basis, it resulted in complete rewrite of table and shrinking of records to release unwanted space. Empty pages (128KB) were also removed from the scan list (although they still exist in the table) along with empty extents (3MB).
Note 1: “empty” here means that the page/extent only contains deleted rows, and those rows are currently reclaimable — i.e. not of interest.
Resolution for releasing space and for optimum performance
Earlier the system were using below command as per recommendation of Netezza support, releasing only free pages.
#nohup nz_groom DB_NAME -pages ALL -backupset none
Now, after implementation of the below command for Netezza table grooming as per the requirement frees up desirable space for the loads and other operations.
#nohup nz_groom DB_NAME -records ALL -backupset none
Result after implementation of resolution applied to system
The DBA team changed Netezza table grooming policy from page level to record level and achieved the desired result.
Figure 4: Netezza used and free space in a Netezza system after proper grooming of fragmented tables
This helped them reclaim up to 50 percent disk space and enhanced system performance as well.
Note 2: two things are important to note here; first, the deleted data is not retrievable unless it is backed up properly. Secondly, grooming all the records is a time consuming process, as it rewrites all table data again, so it must be run when system has no busy schedule such as weekends.
STATISTICS and GROOM are two premier elements for optimum performance in NETEZZA. It is, therefore, crucial to focus on both these aspects and automated them periodically to ensure NETEZZA appliance offers peak performance. In addition, it is difficult to GROOM tables record-wise every week, as it requires rewriting each row, making it a resource intensive and time-consuming initiative. However, we can schedule GROOM page wise weekly and record wise quarterly.
That is all from us on Netezza Groom, until next time.
Let us know your thoughts on the subject in the comments below.
- Column Store Index in SQL Server 2012
This post is about the new feature, i.e., Column Store Index which is available since SQL 2012 version. Microsoft has released column store index to improve the performance by 10x.…
- Teradata Intelligent Memory (TIM)
Overview of Teradata Teradata is a RDBMS (Relational Database Management System). This system is solely based on off-the-shelf (readymade) symmetric multiprocessing (SMP) technology combined with communication networking, connecting SMP systems…
- Content Data Store
Content Data Store Content Data Store (CDS) is a system to provide storage facilities to massive data sets in the form of images, pdfs, documents and scanned documents. This dataset…
- An Introduction to Azure Data Lake
After working on Data Warehouses and understanding its utility and impact on business, we determine that they are like systematic storage. How awesome will it be to have a system…
- Unclogging the Mind: Data Science vs. Big Data vs. Data Analytics
We are living in a data-driven world, where most of our acts are either an action or a reaction to the information we receive. According to reports, we are creating…
- Oracle Goldengate
Oracle GoldenGate is an Oracle proprietary software for real-time data integration and replication that supports different databases- Oracle, DB2, SQL Server, Ingres, MySQL etc. Even the source and target database…