(Extract from Informatica performance tuning Guide )
Optimizing
Transformations
This
chapter includes the following topics:
¨ Optimizing Aggregator Transformations,
¨ Optimizing Custom Transformations,
¨ Optimizing Joiner Transformations,
¨ Optimizing Lookup Transformations,
¨ Optimizing Sequence Generator Transformations,
¨ Optimizing Sorter Transformations,
¨ Optimizing Source Qualifier Transformations,
¨ Optimizing SQL Transformations,
Optimizing
Aggregator Transformations
Aggregator
transformations often slow performance because they must group data before
processing it.
Aggregator
transformations need additional memory to hold intermediate group results.
Use the
following guidelines to optimize the performance of an Aggregator
transformation:
¨ Group by simple columns.
¨ Use sorted input.
¨ Use incremental aggregation.
¨ Filter data before you aggregate it.
¨ Limit port connections.
Grouping
By Simple Columns
You can
optimize Aggregator transformations when you group by simple columns. When possible,
use numbers
instead of
string and dates in the columns used for the GROUP BY. Avoid complex
expressions in the Aggregator
expressions.
Using
Sorted Input
To increase
session performance, sort data for the Aggregator transformation. Use the
Sorted Input option to sort
data.
The Sorted
Input option decreases the use of aggregate caches. When you use the Sorted
Input option, the
Integration
Service assumes all data is sorted by group. As the Integration Service reads
rows for a group, it
performs aggregate
calculations. When necessary, it stores group information in memory.
The Sorted
Input option reduces the amount of data cached during the session and improves
performance. Use
this option
with the Source Qualifier Number of Sorted Ports option or a Sorter
transformation to pass sorted data
to the
Aggregator transformation.
You can
increase performance when you use the Sorted Input option in sessions with
multiple partitions.
Using
Incremental Aggregation
If you can
capture changes from the source that affect less than half the target, you can
use incremental
aggregation
to optimize the performance of Aggregator transformations.
When you
use incremental aggregation, you apply captured changes in the source to
aggregate calculations in a
session.
The Integration Service updates the target incrementally, rather than
processing the entire source and
recalculating
the same calculations every time you run the session.
You can
increase the index and data cache sizes to hold all data in memory without
paging to disk.
Filtering
Data Before You Aggregate
Filter the
data before you aggregate it. If you use a Filter transformation in the
mapping, place the transformation
before the
Aggregator transformation to reduce unnecessary aggregation.
Limiting
Port Connections
Limit the
number of connected input/output or output ports to reduce the amount of data
the Aggregator
transformation
stores in the data cache.
Optimizing
Custom Transformations
The
Integration Service can pass a single row to a Custom transformation procedure
or a block of rows in an
array. You
can write the procedure code to specify whether the procedure receives one row
or a block of rows.
You can
increase performance when the procedure receives a block of rows:
¨ You can decrease the number of function calls the Integration
Service and procedure make. The Integration
Service
calls the input row notification function fewer times, and the procedure calls
the output notification
function
fewer times.
¨ You can increase the locality of memory access space for the data.
¨ You can write the procedure code to perform an algorithm on a
block of data instead of each row of data.
Optimizing
Joiner Transformations
Joiner
transformations can slow performance because they need additional space at run
time to hold intermediary
results.
You can view Joiner performance counter information to determine whether you
need to optimize the
Joiner
transformations.
Use the
following tips to improve session performance with the Joiner transformation:
¨ Designate the master source as the source with fewer duplicate key
values. When the Integration Service
processes a
sorted Joiner transformation, it caches rows for one hundred unique keys at a
time. If the master
source
contains many rows with the same key value, the Integration Service must cache
more rows, and
performance
can be slowed.
¨ Designate the master source as the source with fewer rows. During a session, the Joiner transformation
compares
each row of the detail source against the master source. The fewer rows in the
master, the fewer
iterations
of the join comparison occur, which speeds the join process.
¨ Perform joins in a database when possible. Performing a join in a database is faster than performing a join
in the
session. The type of database join you use can affect performance. Normal joins
are faster than outer
joins and
result in fewer rows. In some cases, you cannot perform the join in the
database, such as joining
tables from
two different databases or flat file systems.
To perform
a join in a database, use the following options:
- Create a pre-session stored procedure to join the tables in a
database.
- Use the Source Qualifier transformation to perform the join.
¨ Join sorted data when possible. To improve
session performance, configure the Joiner transformation to use
sorted
input. When you configure the Joiner transformation to use sorted data, the
Integration Service improves
performance
by minimizing disk input and output. You see the greatest performance
improvement when you
work with
large data sets. For an unsorted Joiner transformation, designate the source
with fewer rows as the
master
source.
Optimizing
Lookup Transformations
If the
lookup table is on the same database as the source table in your mapping and
caching is not feasible, join
the tables
in the source database rather than using a Lookup transformation.
If you use
a Lookup transformation, perform the following tasks to increase performance:
¨ Use the optimal database driver.
¨ Cache lookup tables.
¨ Optimize the lookup condition.
¨ Filter lookup rows.
¨ Index the lookup table.
¨ Optimize multiple lookups.
¨ Create a pipeline Lookup transformation and configure partitions
in the pipeline that builds the lookup source.
Using
Optimal Database Drivers
The
Integration Service can connect to a lookup table using a native database
driver or an ODBC driver. Native
database
drivers provide better session performance than ODBC drivers.
Optimizing
Joiner Transformations 23
Caching
Lookup Tables
If a
mapping contains Lookup transformations, you might want to enable lookup
caching. When you enable
caching,
the Integration Service caches the lookup table and queries the lookup cache
during the session. When
this option
is not enabled, the Integration Service queries the lookup table on a
row-by-row basis.
The result
of the Lookup query and processing is the same, whether or not you cache the
lookup table. However,
using a
lookup cache can increase session performance for smaller lookup tables. In
general, you want to cache
lookup
tables that need less than 300 MB.
Complete
the following tasks to further enhance performance for Lookup transformations:
¨ Use the appropriate cache type.
¨ Enable concurrent caches.
¨ Optimize Lookup condition matching.
¨ Reduce the number of cached rows.
¨ Override the ORDER BY statement.
¨ Use a machine with more memory.
Types of
Caches
Use the
following types of caches to increase performance:
¨ Shared cache. You can share the lookup
cache between multiple transformations. You can share an unnamed
cache
between transformations in the same mapping. You can share a named cache
between transformations
in the same
or different mappings.
¨ Persistent cache. To save and reuse
the cache files, you can configure the transformation to use a persistent
cache. Use
this feature when you know the lookup table does not change between session
runs. Using a
persistent
cache can improve performance because the Integration Service builds the memory
cache from the
cache files
instead of from the database.
Enabling
Concurrent Caches
When the
Integration Service processes sessions that contain Lookup transformations, the
Integration Service
builds a
cache in memory when it processes the first row of data in a cached Lookup
transformation. If there are
multiple
Lookup transformations in a mapping, the Integration Service creates the caches
sequentially when the
first row
of data is processed by the Lookup transformation. This slows Lookup
transformation processing.
You can
enable concurrent caches to improve performance. When the number of additional
concurrent pipelines is
set to one
or more, the Integration Service builds caches concurrently rather than
sequentially. Performance
improves
greatly when the sessions contain a number of active transformations that may
take time to complete,
such as
Aggregator, Joiner, or Sorter transformations. When you enable multiple
concurrent pipelines, the
Integration
Service no longer waits for active sessions to complete before it builds the
cache. Other Lookup
transformations
in the pipeline also build caches concurrently.
Optimizing
Lookup Condition Matching
When the
Lookup transformation matches lookup cache data with the lookup condition, it
sorts and orders the data
to
determine the first matching value and the last matching value. You can
configure the transformation to return
any value
that matches the lookup condition. When you configure the Lookup transformation
to return any
matching
value, the transformation returns the first value that matches the lookup
condition. It does not index all
24
Chapter 6: Optimizing Transformations
ports as it
does when you configure the transformation to return the first matching value
or the last matching value.
When you
use any matching value, performance can improve because the transformation does
not index on all
ports,
which can slow performance.
Reducing
the Number of Cached Rows
You can
reduce the number of rows included in the cache to increase performance. Use
the Lookup SQL Override
option to
add a WHERE clause to the default SQL statement.
Overriding
the ORDER BY Statement
By default,
the Integration Service generates an ORDER BY statement for a cached lookup.
The ORDER BY
statement
contains all lookup ports. To increase performance, suppress the default ORDER
BY statement and
enter an
override ORDER BY with fewer columns.
The
Integration Service always generates an ORDER BY statement, even if you enter
one in the override. Place
two dashes
‘--’ after the ORDER BY override to suppress the generated ORDER BY statement.
For
example, a Lookup transformation uses the following lookup condition:
ITEM_ID = IN_ITEM_ID
PRICE <= IN_PRICE
The Lookup
transformation includes three lookup ports used in the mapping, ITEM_ID,
ITEM_NAME, and PRICE.
When you
enter the ORDER BY statement, enter the columns in the same order as the ports
in the lookup
condition.
You must also enclose all database reserved words in quotes.
Enter the
following lookup query in the lookup SQL override:
SELECT ITEMS_DIM.ITEM_NAME, ITEMS_DIM.PRICE, ITEMS_DIM.ITEM_ID
FROM ITEMS_DIM ORDER BY
ITEMS_DIM.ITEM_ID, ITEMS_DIM.PRICE --
Using a
Machine with More Memory
To increase
session performance, run the session on an Integration Service node with a
large amount of memory.
Increase
the index and data cache sizes as high as you can without straining the
machine. If the Integration
Service
node has enough memory, increase the cache so it can hold all data in memory
without paging to disk.
Optimizing
the Lookup Condition
If you
include more than one lookup condition, place the conditions in the following
order to optimize lookup
performance:
¨ Equal to (=)
¨ Less than (<), greater than (>), less than or equal to
(<=), greater than or equal to (>=)
¨ Not equal to (!=)
Filtering
Lookup Rows
Create a
filter condition to reduce the number of lookup rows retrieved from the source
when the lookup cache is
built.
Indexing
the Lookup Table
The
Integration Service needs to query, sort, and compare values in the lookup
condition columns. The index
needs to
include every column used in a lookup condition.
Optimizing
Lookup Transformations 25
You can
improve performance for the following types of lookups:
¨ Cached lookups. To improve
performance, index the columns in the lookup ORDER BY statement. The
session log
contains the ORDER BY statement.
¨ Uncached lookups. To improve
performance, index the columns in the lookup condition. The Integration
Service
issues a SELECT statement for each row that passes into the Lookup
transformation.
Optimizing
Multiple Lookups
If a
mapping contains multiple lookups, even with caching enabled and enough heap
memory, the lookups can
slow
performance. Tune the Lookup transformations that query the largest amounts of
data to improve overall
performance.
To
determine which Lookup transformations process the most data, examine the
Lookup_rowsinlookupcache
counters
for each Lookup transformation. The Lookup transformations that have a large
number in this counter
might
benefit from tuning their lookup expressions. If those expressions can be
optimized, session performance
improves.
Creating
a Pipeline Lookup Transformation
A mapping
that contains a pipeline Lookup transformation includes a partial pipeline that
contains the lookup
source and
a source qualifier. The Integration Service processes the lookup source data in
this pipeline. It passes
the lookup
source data to the pipeline that contains the Lookup transformation and it
creates the cache.
The partial
pipeline is a separate target load order group in session properties. You can
configure multiple
partitions
in this pipeline to improve performance.
Optimizing
Sequence Generator Transformations
To optimize
Sequence Generator transformations, create a reusable Sequence Generator and
using it in multiple
mappings
simultaneously. Also, configure the Number of Cached Values property.
The Number
of Cached Values property determines the number of values the Integration
Service caches at one
time. Make
sure that the Number of Cached Value is not too small. Consider configuring the
Number of Cached
Values to a
value greater than 1,000.
If you do
not have to cache values, set the Number of Cache Values to 0. Sequence
Generator transformations
that do not
use cache are faster than those that require cache.
Optimizing
Sorter Transformations
Complete
the following tasks to optimize a Sorter transformation:
¨ Allocate enough memory to sort the data.
¨ Specify a different work directory for each partition in the
Sorter transformation.
Allocating
Memory
For optimal
performance, configure the Sorter cache size with a value less than or equal to
the amount of
available
physical RAM on the Integration Service node. Allocate at least 16 MB of
physical memory to sort data
using the
Sorter transformation. The Sorter cache size is set to 16,777,216 bytes by
default. If the Integration
Service
cannot allocate enough memory to sort data, it fails the session.
If the
amount of incoming data is greater than the amount of Sorter cache size, the
Integration Service temporarily
stores data
in the Sorter transformation work directory. The Integration Service requires
disk space of at least
twice the
amount of incoming data when storing data in the work directory. If the amount
of incoming data is
significantly
greater than the Sorter cache size, the Integration Service may require much
more than twice the
amount of
disk space available to the work directory.
Use the
following formula to determine the size of incoming data:
# input rows ([Sum(column size)] + 16)
Work
Directories for Partitions
The
Integration Service creates temporary files when it sorts data. It stores them
in a work directory. You can
specify any
directory on the Integration Service node to use as a work directory. By
default, the Integration Service
uses the
value specified for the $PMTempDir service process variable.
When you
partition a session with a Sorter transformation, you can specify a different
work directory for each
partition
in the pipeline. To increase session performance, specify work directories on
physically separate disks on
the
Integration Service nodes.
Optimizing
Source Qualifier Transformations
Use the
Select Distinct option for the Source Qualifier transformation if you want the
Integration Service to select
unique
values from a source. Use Select Distinct option to filter unnecessary data
earlier in the data flow. This can
Optimizing
SQL Transformations
When you
create an SQL transformation, you configure the transformation to use external
SQL queries or queries
that you
define in the transformation. When you configure an SQL transformation to run
in script mode, the
Integration
Service processes an external SQL script for each input row. When the
transformation runs in query
mode, the
Integration Service processes an SQL query that you define in the transformation.
Each time
the Integration Service processes a new query in a session, it calls a function
called SQLPrepare to
create an
SQL procedure and pass it to the database. When the query changes for each
input row, it has a
performance
impact.
When the
transformation runs in query mode, construct= a static query in the
transformation to improve
performance.
A static query statement does not change, although the data in the query clause
changes. To create
a static
query, use parameter binding instead of string substitution in the SQL Editor.
When you use parameter
binding you
set parameters in the query clause to values in the transformation input ports.
When an SQL
query contains commit and rollback query statements, the Integration Service
must recreate the
SQL
procedure after each commit or rollback. To optimize performance, do not use
transaction statements in an
SQL
transformation query.
When you
create the SQL transformation, you configure how the transformation connects to
the database. You
can choose
a static connection or you can pass connection information to the
transformation at run time.
When you
configure the transformation to use a static connection, you choose a
connection from the Workflow
Manager
connections. The SQL transformation connects to the database once during the
session. When you pass
dynamic
connection information, the SQL transformation connects to the database each
time the transformation
processes
an input row.