Maciej Puchalski's profile

F-One Data Integration, ETL

F-One Data Integration, ETL
MS SQL Server, SSIS
 
: Data Integration, Data Warehouse
:
Even a small bank has a very complex operationg environment and lots of systems supporting its operations.
On the other hand a lot of reporting needs arise on managment levels.
The problem is to collect data from all systems, to satisfy those reporting requirements, especially when some systems hold different versions of the same data.
:
F-One project is focused on gathering the correct and complete data from various systems, both local and remote, in one place (integration phase) - hence building Operational Data Store (a relational database holding all the integrated data).
On top of ODS we build a data warehouse with many data marts, as needed for top management reporting.
Low level operational reporting is done using ODS.
:
Project Owner (K.J.)
Project Manager/Main Analyst (P.W.)
Analysts/Consultants (A.K., R.F., A.M., )
Main Developer (M.K.)
Additional Developers (A.K, J.G., )
: Consultant, Developer
:
- Microsoft SQL Server 2008 R2
- Integration Services (SSIS)
 
:
(1) Consultancy on gathering data from existing systems, especially based on Oracle databases.
(2) Consultancy on some project organization problems, e.g. processing bad data (what to do with data that cannot be imported to F-One database for various reasons, like bad format, out of range, no matching parent record, etc.)
(3) SSIS packages for importing data from CSV files into F-One database.
(4) SSIS packages for extracting data from other systems (based on Oracle databases) and delivering CSV files ready for import.

:
 (1) - a side system that gathers data about partners(customers) from a few other systems and makes a consistent partner database, that delivers data also for F-One
(2) - a follow-up to F-One integration, but listed separatelly for clarity
(3) - procedures generating data to satisfy legally enforced reporting to national Banking Guarantee Fund
 

Task assigned: Extract data from other systems (based on Oracle databases) and deliver CSV files ready for import.
My work:
- designed process of contolled (configurable, calendar based) data extraction
- created job configuration calendar (in Oracle table)
- created SSIS packages to do the extraction
- configured SQL Agent job to call these packages
  
  
  CREATE TABLE AGENT.TB_JOB_CALENDAR
(
JOB_NAME VARCHAR2(30 BYTE) NOT NULL,
JOB_INDEX VARCHAR2(30 BYTE) NOT NULL,
RUN_OPTIONS VARCHAR2(60 BYTE),
RUN_PARAMETERS VARCHAR2(2000 BYTE),
EXEC_PLAN DATE NOT NULL,
EXEC_STARTED DATE,
EXEC_FINISHED DATE,
EXEC_RESULT VARCHAR2(2000 BYTE)
) ...
Primary Key is (JOB_NAME, JOB_INDEX)
JOB_NAME is a Foreign Key to TB_JOB (where jobs are generally described).
JOB_INDEX is the unique identifier for individual job execution, I encode this field using date and supplemental symbol.
RUN_PARAMETERS is used inside SSIS package for parsing paramter values from xml encoded string.
EXEC_PLAN is date and time when the job should be executed by package.
JOB_NAME JOB_INDEX RUN_OPTIONS RUN_PARAMETERS EXEC_PLAN EXEC_STARTED EXEC_FINISHED EXEC_RESULT
F1_IMP_FILEGEN 20120604C DLR,FBPL,LEO,WEB <param name="data_od" value="2012-06-02" /><param name="data_do" value="2012-06-04"/><param name="filedate" value="2012-06-04"/> <param name="run_exports" value="DLR,FBPL,LEO,WEB" /> 2012-06-05 06:06:16 2012-06-05 06:21:02 2012-06-05 06:24:28 OK.
F1_IMP_FILEGEN 20120604P IPAR <param name="data_od" value="2012-06-02" /><param name="data_do" value="2012-06-04"/><param name="filedate" value="2012-06-04"/> <param name="run_exports" value="IPAR" /> 2012-06-05 06:45:20 2012-06-05 07:01:01 2012-06-05 07:02:23 OK.
 
There ia a job on SQL Server Agent calling this package, running on workdays and at the end of month, from 6:01 to 10:05, every 20 minutes.
The first thing the package does is check for its configuration in AGENT.TB_JOB_CALENDAR.
If it finds "missed jobs" (that were never executed, but EXEC_PLAN date is earlier then today) - it issues a warning to be displayed via separate system (tahat I also created) to administrator(s).
If there are less then 10 remaining records (future EXEC_PLAN date) in TB_JOB_CALENDAR - issue another warning, that it is necessary to fill job configuration for next days.
We check status of IntegraPartner for the moment if it is ready to be processed (set a variable in package).
Then we get current jobs, that have EXEC_PLAN between midnight and current date-time and have null EXEC_RESULT, we do process them in a foreach loop.
If the current job contains parameter for processing and exporting IntegraPartner data, it is only done if we previously got status "ready". If not, it finishes (waits) and when SQL Agent runs in 20 minutes the schedule again, we have another try at processing IntegraPartner.
We mark the current job "started", so it is not called again, before it finishes.
We call other SSIS packages that do exports in different data areas conditioanlly, depending on configuration of the job (parameters read from TB_JOB_CALENDAR.RUN_PARAMETERS).
After successful completion of all the nested packages, we write "job done" to relevant record in TB_JOB_CALENDAR, so this job does not get started again.
When en error occurs, it is caught by OnError event handler, which marks the job as "error", issues warning to administrator and sends e-mail notice also.
 
One of the nested packages is special. It does not export data from database, but it downloads a file from http address.
It does not contain any Data Flow, just a script task that uses package variables as parameters (as job date) and reads connection parameters from database to download proper file.
Parameters from database are necessary to allow user/password managment that are required to connect to http-proxy amd are subject to passwor policy (hence must be changed periodically). You can see the code listed as a separate project with open/free license : "freeCode, VB.NET, get file over http proxy".
Another application was created just to manage this technical web access account, which saves this configuration in database (so it can be read by the SSIS package).
 
There are actually two packages for each CSV file imported - first one for simple getting the data from file into helper import database (IMP) usually trivial with only basic format transformations,
second one for transforming data and putting it into ODS database (ETLT model).
The package I developed was a complex one, having 24 transformations in the main Data Flow.
The file BUC690 contains a kind of "booking protocol".
Control flow is shown on the first picture below. Next two pictures present the main data flow, also described here:

I have to take account number, make a lookup to find matching AccountId and add proper derived column, potentially with null value (on NoMatch).
Second, I do similar routine to find DealerId based on dealer code (null when transaction did not concern any dealer).
Next, if transaction is in foreign currency, I do find exchange rate and calculate value in local currency.
Then some other checkings are done for different types of transactions (data stream is conditionaly split).
There are cases, when a given type of transaction has no matching record for critical attribute, a warning is issued to administrator.
Finaly, the merged streams are saved in destination table (ODS database).
F-One Data Integration, ETL
Published:

F-One Data Integration, ETL

F-One, a company-wide data integration project with data warehouse on top to satisfy all-managment-levels reporting requirements.

Published: