BUSINESS !NTELLIGENCE .: Different steps to tune different transformations

(Extract from Informatica performance tuning Guide )

Optimizing Transformations

This chapter includes the following topics:

¨ Optimizing Aggregator Transformations,

¨ Optimizing Custom Transformations,

¨ Optimizing Joiner Transformations,

¨ Optimizing Lookup Transformations,

¨ Optimizing Sequence Generator Transformations,

¨ Optimizing Sorter Transformations,

¨ Optimizing Source Qualifier Transformations,

¨ Optimizing SQL Transformations,

Optimizing Aggregator Transformations

Aggregator transformations often slow performance because they must group data before processing it.

Aggregator transformations need additional memory to hold intermediate group results.

Use the following guidelines to optimize the performance of an Aggregator transformation:

¨ Group by simple columns.

¨ Use sorted input.

¨ Use incremental aggregation.

¨ Filter data before you aggregate it.

¨ Limit port connections.

Grouping By Simple Columns

You can optimize Aggregator transformations when you group by simple columns. When possible, use numbers

instead of string and dates in the columns used for the GROUP BY. Avoid complex expressions in the Aggregator

expressions.

Using Sorted Input

To increase session performance, sort data for the Aggregator transformation. Use the Sorted Input option to sort

data.

The Sorted Input option decreases the use of aggregate caches. When you use the Sorted Input option, the

Integration Service assumes all data is sorted by group. As the Integration Service reads rows for a group, it

performs aggregate calculations. When necessary, it stores group information in memory.

The Sorted Input option reduces the amount of data cached during the session and improves performance. Use

this option with the Source Qualifier Number of Sorted Ports option or a Sorter transformation to pass sorted data

to the Aggregator transformation.

You can increase performance when you use the Sorted Input option in sessions with multiple partitions.

Using Incremental Aggregation

If you can capture changes from the source that affect less than half the target, you can use incremental

aggregation to optimize the performance of Aggregator transformations.

When you use incremental aggregation, you apply captured changes in the source to aggregate calculations in a

session. The Integration Service updates the target incrementally, rather than processing the entire source and

recalculating the same calculations every time you run the session.

You can increase the index and data cache sizes to hold all data in memory without paging to disk.

Filtering Data Before You Aggregate

Filter the data before you aggregate it. If you use a Filter transformation in the mapping, place the transformation

before the Aggregator transformation to reduce unnecessary aggregation.

Limiting Port Connections

Limit the number of connected input/output or output ports to reduce the amount of data the Aggregator

transformation stores in the data cache.

Optimizing Custom Transformations

The Integration Service can pass a single row to a Custom transformation procedure or a block of rows in an

array. You can write the procedure code to specify whether the procedure receives one row or a block of rows.

You can increase performance when the procedure receives a block of rows:

¨ You can decrease the number of function calls the Integration Service and procedure make. The Integration

Service calls the input row notification function fewer times, and the procedure calls the output notification

function fewer times.

¨ You can increase the locality of memory access space for the data.

¨ You can write the procedure code to perform an algorithm on a block of data instead of each row of data.

Optimizing Joiner Transformations

Joiner transformations can slow performance because they need additional space at run time to hold intermediary

results. You can view Joiner performance counter information to determine whether you need to optimize the

Joiner transformations.

Use the following tips to improve session performance with the Joiner transformation:

¨ Designate the master source as the source with fewer duplicate key values. When the Integration Service

processes a sorted Joiner transformation, it caches rows for one hundred unique keys at a time. If the master

source contains many rows with the same key value, the Integration Service must cache more rows, and

performance can be slowed.

¨ Designate the master source as the source with fewer rows. During a session, the Joiner transformation

compares each row of the detail source against the master source. The fewer rows in the master, the fewer

iterations of the join comparison occur, which speeds the join process.

¨ Perform joins in a database when possible. Performing a join in a database is faster than performing a join

in the session. The type of database join you use can affect performance. Normal joins are faster than outer

joins and result in fewer rows. In some cases, you cannot perform the join in the database, such as joining

tables from two different databases or flat file systems.

To perform a join in a database, use the following options:

- Create a pre-session stored procedure to join the tables in a database.

- Use the Source Qualifier transformation to perform the join.

¨ Join sorted data when possible. To improve session performance, configure the Joiner transformation to use

sorted input. When you configure the Joiner transformation to use sorted data, the Integration Service improves

performance by minimizing disk input and output. You see the greatest performance improvement when you

work with large data sets. For an unsorted Joiner transformation, designate the source with fewer rows as the

master source.

Optimizing Lookup Transformations

If the lookup table is on the same database as the source table in your mapping and caching is not feasible, join

the tables in the source database rather than using a Lookup transformation.

If you use a Lookup transformation, perform the following tasks to increase performance:

¨ Use the optimal database driver.

¨ Cache lookup tables.

¨ Optimize the lookup condition.

¨ Filter lookup rows.

¨ Index the lookup table.

¨ Optimize multiple lookups.

¨ Create a pipeline Lookup transformation and configure partitions in the pipeline that builds the lookup source.

Using Optimal Database Drivers

The Integration Service can connect to a lookup table using a native database driver or an ODBC driver. Native

database drivers provide better session performance than ODBC drivers.

Optimizing Joiner Transformations 23

Caching Lookup Tables

If a mapping contains Lookup transformations, you might want to enable lookup caching. When you enable

caching, the Integration Service caches the lookup table and queries the lookup cache during the session. When

this option is not enabled, the Integration Service queries the lookup table on a row-by-row basis.

The result of the Lookup query and processing is the same, whether or not you cache the lookup table. However,

using a lookup cache can increase session performance for smaller lookup tables. In general, you want to cache

lookup tables that need less than 300 MB.

Complete the following tasks to further enhance performance for Lookup transformations:

¨ Use the appropriate cache type.

¨ Enable concurrent caches.

¨ Optimize Lookup condition matching.

¨ Reduce the number of cached rows.

¨ Override the ORDER BY statement.

¨ Use a machine with more memory.

Types of Caches

Use the following types of caches to increase performance:

¨ Shared cache. You can share the lookup cache between multiple transformations. You can share an unnamed

cache between transformations in the same mapping. You can share a named cache between transformations

in the same or different mappings.

¨ Persistent cache. To save and reuse the cache files, you can configure the transformation to use a persistent

cache. Use this feature when you know the lookup table does not change between session runs. Using a

persistent cache can improve performance because the Integration Service builds the memory cache from the

cache files instead of from the database.

Enabling Concurrent Caches

When the Integration Service processes sessions that contain Lookup transformations, the Integration Service

builds a cache in memory when it processes the first row of data in a cached Lookup transformation. If there are

multiple Lookup transformations in a mapping, the Integration Service creates the caches sequentially when the

first row of data is processed by the Lookup transformation. This slows Lookup transformation processing.

You can enable concurrent caches to improve performance. When the number of additional concurrent pipelines is

set to one or more, the Integration Service builds caches concurrently rather than sequentially. Performance

improves greatly when the sessions contain a number of active transformations that may take time to complete,

such as Aggregator, Joiner, or Sorter transformations. When you enable multiple concurrent pipelines, the

Integration Service no longer waits for active sessions to complete before it builds the cache. Other Lookup

transformations in the pipeline also build caches concurrently.

Optimizing Lookup Condition Matching

When the Lookup transformation matches lookup cache data with the lookup condition, it sorts and orders the data

to determine the first matching value and the last matching value. You can configure the transformation to return

any value that matches the lookup condition. When you configure the Lookup transformation to return any

matching value, the transformation returns the first value that matches the lookup condition. It does not index all

24 Chapter 6: Optimizing Transformations

ports as it does when you configure the transformation to return the first matching value or the last matching value.

When you use any matching value, performance can improve because the transformation does not index on all

ports, which can slow performance.

Reducing the Number of Cached Rows

You can reduce the number of rows included in the cache to increase performance. Use the Lookup SQL Override

option to add a WHERE clause to the default SQL statement.

Overriding the ORDER BY Statement

By default, the Integration Service generates an ORDER BY statement for a cached lookup. The ORDER BY

statement contains all lookup ports. To increase performance, suppress the default ORDER BY statement and

enter an override ORDER BY with fewer columns.

The Integration Service always generates an ORDER BY statement, even if you enter one in the override. Place

two dashes ‘--’ after the ORDER BY override to suppress the generated ORDER BY statement.

For example, a Lookup transformation uses the following lookup condition:

ITEM_ID = IN_ITEM_ID

PRICE <= IN_PRICE

The Lookup transformation includes three lookup ports used in the mapping, ITEM_ID, ITEM_NAME, and PRICE.

When you enter the ORDER BY statement, enter the columns in the same order as the ports in the lookup

condition. You must also enclose all database reserved words in quotes.

Enter the following lookup query in the lookup SQL override:

SELECT ITEMS_DIM.ITEM_NAME, ITEMS_DIM.PRICE, ITEMS_DIM.ITEM_ID FROM ITEMS_DIM ORDER BY

ITEMS_DIM.ITEM_ID, ITEMS_DIM.PRICE --

Using a Machine with More Memory

To increase session performance, run the session on an Integration Service node with a large amount of memory.

Increase the index and data cache sizes as high as you can without straining the machine. If the Integration

Service node has enough memory, increase the cache so it can hold all data in memory without paging to disk.

Optimizing the Lookup Condition

If you include more than one lookup condition, place the conditions in the following order to optimize lookup

performance:

¨ Equal to (=)

¨ Less than (<), greater than (>), less than or equal to (<=), greater than or equal to (>=)

¨ Not equal to (!=)

Filtering Lookup Rows

Create a filter condition to reduce the number of lookup rows retrieved from the source when the lookup cache is

built.

Indexing the Lookup Table

The Integration Service needs to query, sort, and compare values in the lookup condition columns. The index

needs to include every column used in a lookup condition.

Optimizing Lookup Transformations 25

You can improve performance for the following types of lookups:

¨ Cached lookups. To improve performance, index the columns in the lookup ORDER BY statement. The

session log contains the ORDER BY statement.

¨ Uncached lookups. To improve performance, index the columns in the lookup condition. The Integration

Service issues a SELECT statement for each row that passes into the Lookup transformation.

Optimizing Multiple Lookups

If a mapping contains multiple lookups, even with caching enabled and enough heap memory, the lookups can

slow performance. Tune the Lookup transformations that query the largest amounts of data to improve overall

performance.

To determine which Lookup transformations process the most data, examine the Lookup_rowsinlookupcache

counters for each Lookup transformation. The Lookup transformations that have a large number in this counter

might benefit from tuning their lookup expressions. If those expressions can be optimized, session performance

improves.

Creating a Pipeline Lookup Transformation

A mapping that contains a pipeline Lookup transformation includes a partial pipeline that contains the lookup

source and a source qualifier. The Integration Service processes the lookup source data in this pipeline. It passes

the lookup source data to the pipeline that contains the Lookup transformation and it creates the cache.

The partial pipeline is a separate target load order group in session properties. You can configure multiple

partitions in this pipeline to improve performance.

Optimizing Sequence Generator Transformations

To optimize Sequence Generator transformations, create a reusable Sequence Generator and using it in multiple

mappings simultaneously. Also, configure the Number of Cached Values property.

The Number of Cached Values property determines the number of values the Integration Service caches at one

time. Make sure that the Number of Cached Value is not too small. Consider configuring the Number of Cached

Values to a value greater than 1,000.

If you do not have to cache values, set the Number of Cache Values to 0. Sequence Generator transformations

that do not use cache are faster than those that require cache.

Optimizing Sorter Transformations

Complete the following tasks to optimize a Sorter transformation:

¨ Allocate enough memory to sort the data.

¨ Specify a different work directory for each partition in the Sorter transformation.

Allocating Memory

For optimal performance, configure the Sorter cache size with a value less than or equal to the amount of

available physical RAM on the Integration Service node. Allocate at least 16 MB of physical memory to sort data

using the Sorter transformation. The Sorter cache size is set to 16,777,216 bytes by default. If the Integration

Service cannot allocate enough memory to sort data, it fails the session.

If the amount of incoming data is greater than the amount of Sorter cache size, the Integration Service temporarily

stores data in the Sorter transformation work directory. The Integration Service requires disk space of at least

twice the amount of incoming data when storing data in the work directory. If the amount of incoming data is

significantly greater than the Sorter cache size, the Integration Service may require much more than twice the

amount of disk space available to the work directory.

Use the following formula to determine the size of incoming data:

# input rows ([Sum(column size)] + 16)

Work Directories for Partitions

The Integration Service creates temporary files when it sorts data. It stores them in a work directory. You can

specify any directory on the Integration Service node to use as a work directory. By default, the Integration Service

uses the value specified for the $PMTempDir service process variable.

When you partition a session with a Sorter transformation, you can specify a different work directory for each

partition in the pipeline. To increase session performance, specify work directories on physically separate disks on

the Integration Service nodes.

Optimizing Source Qualifier Transformations

Use the Select Distinct option for the Source Qualifier transformation if you want the Integration Service to select

unique values from a source. Use Select Distinct option to filter unnecessary data earlier in the data flow. This can

Optimizing SQL Transformations

When you create an SQL transformation, you configure the transformation to use external SQL queries or queries

that you define in the transformation. When you configure an SQL transformation to run in script mode, the

Integration Service processes an external SQL script for each input row. When the transformation runs in query

mode, the Integration Service processes an SQL query that you define in the transformation.

Each time the Integration Service processes a new query in a session, it calls a function called SQLPrepare to

create an SQL procedure and pass it to the database. When the query changes for each input row, it has a

performance impact.

When the transformation runs in query mode, construct= a static query in the transformation to improve

performance. A static query statement does not change, although the data in the query clause changes. To create

a static query, use parameter binding instead of string substitution in the SQL Editor. When you use parameter

binding you set parameters in the query clause to values in the transformation input ports.

When an SQL query contains commit and rollback query statements, the Integration Service must recreate the

SQL procedure after each commit or rollback. To optimize performance, do not use transaction statements in an

SQL transformation query.

When you create the SQL transformation, you configure how the transformation connects to the database. You

can choose a static connection or you can pass connection information to the transformation at run time.

When you configure the transformation to use a static connection, you choose a connection from the Workflow

Manager connections. The SQL transformation connects to the database once during the session. When you pass

dynamic connection information, the SQL transformation connects to the database each time the transformation

processes an input row.