Happy New Year

Believe or not, while I did not update my blog for more than three years, there has been steady stream of visitors – 2,800 a month to be more precise. 111,000 of you visited my blog in past 5 years. Thank you for reading my blog and comments you posted.

I have to apologize for no new posts. A lot change since July 2010 when I published my last post. I still spend little bit of my time with Oracle, but for the most part diving deeply into Hadoop ecosystem. I promise to share some gotchas soon ;-)

Oracle loves Hadoop – integrate hadoop seamlessly into Oracle

Ever wanted to execute hadoop scripts from Oracle environment and join the results to any oracle table? It’s easier than you think, there is no special tool required, there is no transient (temporary) file created, there is no ETL !! This can be all executed on a fly by any user anytime. Why would you want to do it? Well there can be many reasons, for example you have your customer data stored in oracle and want to join them to very large call files or web longs generated every day.

.

Before I explain the trick, let me show you sample print-screen what I mean by “seamlessly”. Sample below is a query (in this case executed in DataStudio, but it can be pretty much any client you use to connect to oracle – SQLPLUS, TOAD, SAS, MS ACCESS, … you name it), as you can see I run very simple query which in the background goes to remote server, executes hadoop script and returns result back to oracle. User does not need to know that the data reside on non-oracle server (in this case hadoop). Please note the hadoop script is executed on a fly, this is not pre-executed script (it executes every time you run the query).

.

What you need

I used sample VMWARE Hadoop image, I also got Oracle 11.2 Ent on RedHat (this is really nothing special, just standard Oracle install). That’s all you need. Yup, it’s that simple.

.

Unix setting

No don’t worry, there is really no special setting required. The only thing you should do is to exchange public keys between Hadoop and RedHat servers so your ssh won’t require password. Here is very simple how-to

Then create 4 unix directories, you can create just one if you wish, I try to keep things organized.

/home/jiri/Desktop/sample/data    - this will hold dummy empty file
/home/jiri/Desktop/sample/log     - this will hold oracle logs
/home/jiri/Desktop/sample/bad     - this will hold oracle reject records
/home/jiri/Desktop/sample/execute - this will hold hadoop scripts (wrapped in sh)

I use sample hadoop script for calculating PI, this script is part of VMWARE image. To execute this script I create two unix scripts on my RedHat Server. As you can see one script is basically simple wrapper, the other calls hadoop script over ssh. Take these as simple examples, there are many ways how to improve them and make them production ready.

/home/jiri/Desktop/sample/execute/run_hadoop.sh

#!/bin/bash
/home/oracle/Desktop/sample/execute/hadoop_script.sh 2>/dev/null

/home/jiri/Desktop/sample/execute/hadoop_script.sh – please change the IP address to your hadoop server IP or host.

#!/bin/bash
/usr/bin/ssh guest@192.168.149.132 hadoop jar hadoop-examples.jar pi 4 10000

.

the last step is to create dummy empty file  (it’s really just an empty file, this is dummy file you can reuse for all hadoop scripts)

/home/jiri/Desktop/sample/data/dummy.dat

.

Oracle setting

There is really nothing much to setup. I will use external tables with preprocessor (new 11.2 feature). I wrote about this feature in two older posts, you can find them here and here

First let me setup oracle directories (login to oracle as DBA or user with grants to create directories)

 CREATE DIRECTORY dir_sample_data as '/home/jiri/Desktop/sample/data';
 CREATE DIRECTORY dir_sample_log  as '/home/jiri/Desktop/sample/log';
 CREATE DIRECTORY dir_sample_bad  as '/home/jiri/Desktop/sample/bad';
 CREATE DIRECTORY dir_sample_exec as '/home/jiri/Desktop/sample/execute';

then I need to give grants on these directories to a user I will use to run the hadoop (in my case oracle user JIRI)

 GRANT READ        on DIRECTORY dir_sample_data to JIRI;
 GRANT READ, WRITE on DIRECTORY dir_sample_log  to JIRI;
 GRANT READ, WRITE on DIRECTORY dir_sample_bad  to JIRI;
 GRANT EXECUTE     on DIRECTORY dir_sample_exec to JIRI;

now login as normal oracle user (in my case JIRI) and create external table

CREATE TABLE JIRI.MYHADOOP_EXT
 (
 SCRIPT_OUTPUT   VARCHAR2(2000)
 )
 ORGANIZATION EXTERNAL
 ( TYPE ORACLE_LOADER
   DEFAULT DIRECTORY dir_sample_data
   ACCESS PARAMETERS
   ( RECORDS DELIMITED BY NEWLINE
     PREPROCESSOR dir_sample_exec:'run_hadoop.sh'
     SKIP 0
     LOGFILE dir_sample_log:'dummy.log'
     BADFILE dir_sample_bad:'dummy.bad'
     NODISCARDFILE
     FIELDS TERMINATED BY '~' OPTIONALLY ENCLOSED BY '~'
     MISSING FIELD VALUES ARE NULL
   )
   LOCATION (dir_sample_data:'dummy.dat')
 )
 REJECT LIMIT 0
 NOPARALLEL
 NOMONITORING;

That’s it. Now every time user JIRI runs a query which uses MYHADOOP_EXT table, Oracle will execute unix scripts which executes hadoop script on remote server and returns stdout result back to the oracle user.

As you can see oracle external table preprocessor is very powerful tool, you can use it not only during ETL (as oracle documentation suggests unzip example), but for pretty much any unix level manipulation.

Tagged , , ,

Query CLOB or BLOB from oracle table across DB-LINK with a simple view – Solution 2

I will continue my two-part post with second a bit twisted solution. I did not try to optimize this for speed, it’s a simple proof of concept. This solution requires NO staging tables, but you need tiny bit more grants on remote server (you need to be able to create function, type and a view).

Please read Solution 1 first before you continue with Solution 2 and why you can get error ORA-22992: cannot use LOB locators selected from remote tables

.

SOLUTION 2

This is a bit twisted solution, but works. Basically you first need to convert LOB (in these example I refer to CLOB, BLOB is similar but you will probably not use varchar2) to multiple varchar2(4000) on remote server – for example CLOB with size 8100 will be converted to 3 varchar2(4000), then read this data across db-links, then combine it back to CLOB on a local server. This can be all done on a fly without any staging tables.

.

1. REMOTE SERVER

let’s first define one table on remote server and load it with some data.

create table RemotetTableWithClob (myid number, myclob clob);

insert into RemotetTableWithClob (myid, myclob) values (1, 'This is small test');
insert into RemotetTableWithClob (myid, myclob) values (2, 'Another Test');
insert into RemotetTableWithClob (myid, myclob) values (3, 'One more test');
commit;

then we need to create a function for converting CLOB to multiple varchar2 and hide it behind view. Please note I tried to use collection here, but you cannot query collection across db-link, you will be facing error ORA-22804: remote operations not permitted on object tables or user-defined type columns thus I convert the clob to basically multi-row view (number of records is variable and depends on size of LOB). This is very simple

CREATE TYPE object_row_type AS OBJECT (
 MYORDER            NUMBER,
 MYID               NUMBER,
 MYCLOB_AS_VARCHAR  VARCHAR2(4000)    );

CREATE TYPE object_table_type AS TABLE OF object_row_type;

-- convert CLOB to multi-row varchar2
CREATE OR REPLACE FUNCTION CONVERT_CLOB_TO_VARCHAR RETURN object_table_type PIPELINED IS
PRAGMA AUTONOMOUS_TRANSACTION;

v_clob_length  number;
v_loops        number;
v_varchar_size number := 4000;

BEGIN

 FOR cur IN (SELECT myid, myclob from RemoteTableWithClob)
 LOOP
   v_clob_length := dbms_lob.getlength (cur.MyClob);
   v_loops := trunc(v_clob_length/v_varchar_size) + sign (mod(v_clob_length,v_varchar_size ) ) - 1;

   FOR i IN 0..v_loops
   LOOP
     PIPE ROW(object_row_type(i + 1, cur.myid, dbms_lob.substr( cur.MyClob, v_varchar_size, v_varchar_size * i + 1 )));
   END LOOP;

 END LOOP;
 COMMIT;
 RETURN;
END;

now the view on remote server

-- create view to show the clob converted to varchar
-- please notice there is important column MYORDER (order of individual varchar chunks)
CREATE VIEW myRemoteData as
SELECT * FROM TABLE(CONVERT_CLOB_TO_VARCHAR) a;

gg.

2. LOCAL SERVER

on a local server, I read the records and convert them to collection (this can be done in a query using CAST-MULTISET), then I use a function to convert this collection back to CLOB.

.

CREATE PUBLIC DATABASE LINK RemoteServer  USING 'remoteserver';
CREATE OR REPLACE TYPE myTableType  as table of varchar2(4000);

CREATE OR REPLACE FUNCTION JIRI.F_VARCHAR_TO_CLOB (input_table_of_varchar myTableType) RETURN CLOB IS
PRAGMA AUTONOMOUS_TRANSACTION;
v_clob  clob;
BEGIN
FOR i IN 1..input_table_of_varchar.COUNT
LOOP
 -- the concatenation below could be probably done using DBMS_LOB.WRITEAPPEND :-)
 v_clob := v_clob || input_table_of_varchar(i);
END LOOP;
RETURN v_clob;
END;

and here is the final view on local server

-- the view first converts separate varchar2 records to a collection
-- then function is used to convert the collection to CLOB
CREATE OR REPLACE VIEW myRemoteData as
SELECT a.myid,
       F_VARCHAR_TO_CLOB( CAST(MULTISET(SELECT b.MYCLOB_AS_VARCHAR
                                          FROM myRemoteData@remoteserver b
                                         WHERE a.MYID = b.MYID
                                      ORDER BY MYORDER   ) as myTableType)
                        ) myClob
FROM REMOTETABLEWITHCLOB@remoteserver a;

As you can see the process is very simple. I tested it with CLOB with more than 100,000 characters on my tiny VMWARE instance and got pretty decent timing. Please take the code above as a proof of concept, not as a tuned production ready solution.

Tagged , , , , , , ,
Follow

Get every new post delivered to your Inbox.