what’s new in Oracle 11g R2 – external tables preprocessor


I’ve been big fan of Oracle external tables for very long time and I’m very glad to see there are new features coming in 11g R2.

Oracle introduced preprocessor in 11.2 – basically any unix/windows (in Jan 2010 there is no 11.2 on Windows platform yet) script which executes automatically before the external table read.  The output from the script is the source of external table (via stdout in unix). The classic example is processing compressed files – Oracle executes unix script which runs ucompress and the output of this script is the source of external table.

Now you can say that this is nothing exciting, you probably already run uncompress in unix script for all incoming files you load using external tables. Well there is one or better say three main advantages of this new preprocessor.

  • it executes automatically as part of external table read (every time you use SELECT … FROM )
  • it does NOT have to create any objects in the file system – as I mentioned above the output of the script is the source of external table (in unix stdout)
  • the pre-processor by itself generates the data – I already mentioned uncompress, but this can be pretty much any script – for example space available on unix using df -k or list of currently running sessions, … if you are DBA and run unix script to “see” anything for example top sessions, space available, running sessions, …  well you can run simple SELECT * FROM my_external_table to get this info without leaving oracle environment or creating JAVA stored procedures.

to me all features above are huge, (1) – just imagine you have archive folder with all your source files you loaded in past 90 days, all files are compressed to save space and yet you can give access to these files via external tables to selected users without being forced to uncompress them, (2) is great as well simply because there is no need for any additional space nor any temporary staging files (you have 1GB free space and it would take 20GB to uncompress incoming file – no problem)   (3) – well as I mentioned above any df -k or top or  whatever information you need from unix 20x a day, you can get on a fly via external table.

There is however one glitch and huge drawback, the glitch is simple yet important – you need to really understand what you are doing, you are giving users grants to execute unix command(s) -  you need to keep in mind that you are opening possible security hole. The drawback is the fact that you cannot execute this in parallel – in another words if you use PARALLEL in external tables, you need to do some testing and find out whether separate uncompress + parallel load takes longer than preprocessor uncompress+load with no parallel.  You can (if source system in your organization can do it) receive multiple files and load them in parallel (for example 4 incoming compressed files can be uncompressed on a fly using preprocessor and loaded in parallel) – see sample at the bottom.

Here is small how-to … this was tested on RHEL5 + Oracle Release 11.2.0.1.0 in VMWARE 3.0.0

1. Create oracle directories which correspond to existing directories on unix. Please notice one directory gets EXECUTE grant – this is new in 11.2 and basically specifies where the pre-processor command is stored. I created 4 directories, you can of course create just one, I would really recommend to keep the security in mind and assign write/execute grants only and only in folders you need to get these grants on.

 CREATE DIRECTORY dir_sample_data as '/home/jiri/Desktop/sample/data';
 CREATE DIRECTORY dir_sample_log as '/home/jiri/Desktop/sample/log';
 CREATE DIRECTORY dir_sample_bad as '/home/jiri/Desktop/sample/bad';
 CREATE DIRECTORY dir_sample_exec as '/home/jiri/Desktop/sample/execute';
 GRANT READ        on DIRECTORY dir_sample_data to JIRI;
 GRANT READ, WRITE on DIRECTORY dir_sample_log  to JIRI;
 GRANT READ, WRITE on DIRECTORY dir_sample_bad  to JIRI;
 GRANT EXECUTE     on DIRECTORY dir_sample_exec to JIRI;

2. place sample.dat file (see content below) to /home/jiri/Desktop/sample/data directory

CUSTOMER_ID, CUSTOMER_NAME, CUSTOMER_SSN
 123456, JIRI, 444-555-6666
 234567, TOM, 555-666-5555
 345678, CHRIS, 111-222-3333

3. compress it  (I used gzip in this example)

gzip sample.dat

4. place unix script below to /home/jiri/Desktop/sample/execute/run_my_unzip.sh (please notice gunzip output is stdout, not a file – this is important)

#!/bin/bash
 /bin/gunzip -c $1

5. create external table (this is classic comma separated file with header), please notice PREPROCESSOR dir_sample_exec:’run_my_unzip.sh’ you have to specify existing script name, the script has to exist in that directory and you cannot specify any parameters – for example PREPROCESSOR dir_sample_exec:’unzip -c’ will not work and will return error. Although I fully understand this limitation (I believe it is there for security reasons and prevents injecting custom code), this could theoretically open door to very effective decryption – just imagine you receive encrypted files and the passphrase to decrypt the file using private key exists only in oracle (no transient files are created in unix, no SA can decrypt the file, only person with access to DDL can do it – this can be secured even more with oracle vault). I know that there are existing ways to do this, but preprocessor would be a bit easier.

CREATE TABLE JIRI.SAMPLE_01_EXT
 (
 CUSTOMER_ID     NUMBER(6),
 CUSTOMER_NAME   VARCHAR2(60),
 CUSTOMER_SSN    VARCHAR2(60)
 )
 ORGANIZATION EXTERNAL
 ( TYPE ORACLE_LOADER
 DEFAULT DIRECTORY dir_sample_data
 ACCESS PARAMETERS
 ( RECORDS DELIMITED BY NEWLINE
 PREPROCESSOR dir_sample_exec:'run_my_unzip.sh'
 SKIP 1
 LOGFILE dir_sample_log:'sample.log'
 BADFILE dir_sample_bad:'sample.bad'
 NODISCARDFILE
 FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
 MISSING FIELD VALUES ARE NULL
 )
 LOCATION (dir_sample_data:'sample.dat.gz')
 )
 REJECT LIMIT 0
 NOPARALLEL
 NOMONITORING;

as you can see external tables preprocessor is very simple yet very effective for loading files oracle cannot load directly. The most common example would be uncompress or decrypt, but you can of course use more complex scripts and strip footer on a fly or even include FTP as part of your external table load. See my other article no more unix scripts in oracle 11.2 !! for few other a bit unusual samples.

Below is similar external table as above, the only difference is loading in parallel (4 identical compressed files with no header)

CREATE TABLE JIRI.SAMPLE_01_EXT
(
CUSTOMER_ID     NUMBER(6),
CUSTOMER_NAME   VARCHAR2(60),
CUSTOMER_SSN    VARCHAR2(60)
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY dir_sample_data
ACCESS PARAMETERS
( RECORDS DELIMITED BY NEWLINE
PREPROCESSOR dir_sample_exec:'run_my_unzip.sh'
SKIP 0
LOGFILE dir_sample_log:'sample.log'
BADFILE dir_sample_bad:'sample.bad'
NODISCARDFILE
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
MISSING FIELD VALUES ARE NULL
)
LOCATION (dir_sample_data:'sample1.dat.gz', 'sample2.dat.gz', 'sample3.dat.gz','sample4.dat.gz')
)
REJECT LIMIT 0
PARALLEL 4
NOMONITORING;

drop me a note if you have any questions

About these ads
Tagged , , ,

6 thoughts on “what’s new in Oracle 11g R2 – external tables preprocessor

  1. [...] what’s new in Oracle 11g R2 – external tables preprocessor [...]

  2. Allen Brumm says:

    OK, the real reason why the preprocessor facility was added to external tables was to be able to read compressed files “on the fly”. You may ask, why couldn’t this be done with named pipes and specifying the named pipe in the external table LOCATION clause? It is a hassle when doing a parallel query from an external table. It is impossible when doing a parallel query from an external table on a RAC system. Named pipes do not work across node boundaries. That was the real motivation.
    However, if the user is clever, the preprocessor facility can provide a mechanism to access remote files (i.e. without having a global file system namespace.)

    Regards,

    - Allen

  3. jiri says:

    Thank you for the comment. It’s really hard to guess why Oracle did XYZ and what if….

    I view it little bit differently. I don’t think there is any right or wrong here simply because we do not know exactly what was Oracle intention.

    I really don’t think it was added only because of compressed files, compressed files are great sample of this functionality, but as you can see it is very universal and powerful tool. You might have encrypted files you want to decrypt on a fly, you might have footer you want to remove before the ETL starts, …. there might be many reasons. If it would be done because of compression only, then Oracle would include simply option READ_COMPRESSED or something like that to uncompress the data before the load using internal compression library which could run in parallel.

    It’s really hard to guess, but I think they force you to add preprocessor script name instead of any unix code/command to DDL, because it would open huge security hole – users who can create external table would be able to execute any unix command using ORACLE UID.

  4. Allen Brumm says:

    I wasn’t guessing, I implemented it for the reason stated above.
    Regards,

    - Allen

  5. Bill S. says:

    Regarding external tables, is there a way to prevent Oracle from verifying the filename of the external table? I’ve created a preprocessor which generates test data based strictly on the filename (eg. set3_days365_thread7of16 would generate 365 days of test data, send every 16th record starting with record 7). I can experiment with many time spans and parallel threads just by how I create the external table in Oracle.

    My only limitation right now is finding a way to create empty files so Oracle will actually call my preprocessor. It seems to want to confirm the file existence before invoking the preprocessor. Currently I have to create a bunch of 0 byte files before making the call.

    Is there a way around that or a better method for doing super fast data loads of records generated on the fly?

  6. jiri says:

    that’s actually pretty clever way to generate test data!

    yes, the file specified in the location has to exist, small note: the file can be referenced in the preprocessor using $1 (on windows using %1)

    you can use UTL_FILE to create dummy empty file, then call the external table. If you want to do it in a single pass in pure SQL and cannot create the empty file ahead of time, then one option is to use PIPELINED function which first creates the empty file, then calls external table and returns result

    this is not what you need, but shows how to create a file and wrap it in pipelined function http://jiri.wordpress.com/2009/03/18/oracle-fast-parallel-data-unload-into-ascii-files/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: