Thursday, April 30, 2015

Difference between tJava,tJavaRow and tJavaFlex component

There are three Java components in the Custom Code family: tJava, tJavaRow and tJavaFlex

  1. tJava
  • tJava component is used to integrate your custom Java code into a Talend program. 
  • It applies exclusively to the start part of the generated code of the subjob. 
  • It will be executed first but only once in the subjob. 
  • tJava component has no input or output data flow and is used as a separate subjob.  
  • Common use of tJava include setting global or context variables prior to the main data processing stages and printing logging messages.It is udes to display status message and variables.
    
     2.  tJavaRow
  • The tJavaRow code applies exclusively to the main part of the generated code of the subjob. 
  • The Java code inserted through the tJavaRow will be executed for each row. 
  • The tJavaRow component is used as an intermediate component and you are able to access the input flow and transform the data.

     3.  tJavaFlex  

The tJavaFlex has three Java code parts (start, main, end) that enable you to enter personalized code for different purposes. 
  • The Start code is executed prior to any rows being processed, so it is used to initialize the variables.The start part will be executed first but only once in the subjob. 
  • The Main code is executed for every row. You are able to access the input flow and modify the data. The source data is processed at runtime by the tJavaFlex
  • The End code is executed after all the rows have finished processing but only once in a job.


The tJavaFlex component is similar to the tJavaRow component, in that it is included into a flow. The difference between the two components is that the tJavaFlex component has pre and post processes that are performed before and after the individual rows are processed.

Difference between tMap and tJoin component

 tMap and tJoin component are used for mapping data, it shows how to map, join, and filter data from input to output in both batch and real-time modes.

The difference between tMap and tJoin are as follows:-

  • tMap accepts more than one input links i.e it has one Main and others are Lookup.
          tJoin accepts only two input i.e it has one Main and one Lookup link.
  • tMap accepts more than one Outputs links. 
          tJoin which can have a Main and Reject links.
  •  tMap supports more types of join model, includes Unique join, First join ,Inner join and Left Outer join .
           tJoin only support Unique join.  
  • tMap support three match model i.e Unique Match,First Match and All Matches.
     tJoin support only Unique Match.
  • tMap consist of multiple Lookup files.
          tJoin has only one Lookup file.
  • tMap consist of Expression Filter so that we can filter our data.
          tJoin this is not possible.
  • tMap we can use Expression Builder on the columns for complex transformation rules.
          tJoin this is not possible. Exact Match between the key is possible. 
  • tMap support Die on error option.
          tJoin this is not possible.
  • tMap take more space and time to load in the memory.
          tJoin take less time.

Wednesday, April 29, 2015

How to use tExtractRegexFields component in Talend

tExtractRegexFields component is used to extract multiple column from single column using regular expression.

This is Student Input File

Student_ID;Student_Email;Student_Age
101;megha@yahoo.com;21
102;leena@gmail.com;34
103;shailja@hotmail.in;22
104;anupama@twitter.org;19
105;ayushi@facebook.in;24

So firstly create a new job from Job Designs > Create Job.
  • Drag the schema of Student.csv from Metadata > File Delimited > Student.csv and drop it to the design work space and select tfileInputDelimited option from pop window.This Student.csv file is been taken as an input file.

Or you can simply drag this component from the palette and double click on it to open the component properties and click [...] next to the File Name field to specify the path where you have created your input file.
  • Then drag and drop the following components from the palette into the design workspace:-tExtractRegexFields and tLogRow.
  • Connect each component by right clicking and select Row > Main.




  •  Open the component properties of tExtractRegexFields to view the Basic Settings 
  1. Select Student_Email in "Field to split" field.
  2. Type the regular expression in the"Regex" field as "([a-z]*)@([a-z]*).([a-z]*)" is used to match the three parts of an Student_Email column :  Name, Domain and ID .
  3. Click on "Edit Schema" Button.



Drag Student_ID,Sudent_Age as it is from tFileInputDelimited_1(Input - Main) in the output panel of the tExtractRegexFields_1(Output) , click the (+) button to add three columns for the output schemaHere we want to split the Student_Email column into three columns in the output as Name, Domain and ID.


In the Basic settings of tLogRow components, select Table (print values in cells of a table) to show result in table format.

Atlast Run the Job.



In the result you can see that all the five rows are executed successfully and Student_Email column is split into multiple columns.

How to use tExtractDelimitedFields component in Talend

In the earlier post of tNormalize I have shown you how to split multivalued columns values to multiple records. Here In tExtractDelimitedFields  component I will show you, how to extract value of single column into multiple columns or how to generate multiple columns from single column values.

This is Suppliers Input File

Supplier_ID;Supplier_Name;Supplier_Products
101;Nick Pvt Ltd;Desktop,Laptop,Mobiles
102;Shiny Shipping;WiFi Routers,PC,Networking Wires
103;Michael Pvt Ltd;RAM,Hard-Disk,Servers
104;Rose Marry Ship Pvt Ltd;Pen Drive,DDR Memory,Tablet

105;Mark Pvt Ltd;Watch,Refrigerator,CD

In Supplier_Products column there are comma separated values now these values will be split into multiple columns.

So firstly create a new job from Job Designs > Create Job.
  • Drag the schema of Suppliers.csv from Metadata > File Delimited > Suppliers.csv and drop it to the design work space and select tfileInputDelimited option from pop window.This Suppliers.csv file is been taken as an input file.

Or you can simply drag this component from the palette and double click on it to open the component properties and click [...] next to the File Name field to specify the path where you have created your input file.
  • Then drag and drop the following components from the palette into the design workspace:-tExtractDelimitedFields and tLogRow.
  • Connect each component by right clicking and select Row > Main.




  •  Open the component properties of tExtractDelimitedFields to view the Basic Settings 
  1. Select Suppliers_Products in "Field to split" field.
  2. Type the "Field separator" as ","
  3. Click on "Edit Schema" Button.


In the output panel of the tExtractDelimitedFields_1(Output) , click the (+) button to add three columns for the output schemaHere we want to split the Suppliers_Product column into three columns in the output Product_1,Product_2 and Product_3.


In the Basic settings of tLogRow components, select Table (print values in cells of a table) to show result in table format.

Atlast Run the Job.


In the you can see that all the five rows are executed successfully and Suppliers_Product column is split into multiple columns named as  Product_1,Product_2 and Product_3.

How to use tAggregateRow component in Talend


tAggregateRow receives a input and aggregates it based on one or more columns. 

This is Orders Input File :-

Order_ID Customer_ID Employee_ID Order_Date Shipper_ID
11248
1 1 41655 3
11248
2 1 41656 1
11248
4 1 41657 4
11249
9 3 41658 1
11249
10 8 41659 1
11250
11 2 41660 1
11250
13 6 41661 3
11250
6 7 41662 2
11251
7 6 41663 2
11251
3 4 41664 4
11251
2 4 41665 2
11252
11 7 41666 1
11252
7 3 41667 2
11252
5 4 41668 1
11253
8 4 41669 4

Shippers Lookup File

Shipper_ID Shipper_Name Phone
1
Shiny Shipping 4133694044
2
Rose Marry Ship Pvt 5703453631
3
Nick Ltd 3038141672
4
Michle Ltd 7274882222
 

So firstly create a new job from Job Designs > Create Job.
Drag the schema of Orders.csv from Metadata > File Delimited > Orders.csv and drop it to the design work space and select tfileInputDelimited option from pop window.This Orders.csv file is been taken as an input file.
Or you can simply drag this component from the palette and double click on it to open the component properties and click [...] next to the File Name field to specify the path where you have created your Orders.csv file.
Then drag and drop the following components from the palette into the design workspace:-tAggregateRow,tMapt and LogRow.
Connect each component by right clicking and select Row > Main.

Now you can see that I have used a lookup file named as Shippers.csv
Drag the schema of Shippers.csv from Metadata > File Delimited > Shippers.csv and drop it to the design work space. Connect Shippers.csv FileInputDelimited by right clicking and select Row > Main.


Double-click  (tAggregateRow component) to set the properties. Click Edit schema and define the output schema. You can add as many columns as you need to hold the set operations results in the output flow.Here we want two columns Order_ID and Shipper_ID so drag it from Orders (Input - Main) to tAggregareRow_1(Output) column.
·     
  
In this example, we will calculate the how many Order_ID  per Shipper_ID. 

  • In the Group By select Shipper_ID as Output column and Shipper_ID as Input column positionThe first column mentioned as output column in the Group By table is the main for tAggregateRow.
  • In the Operations Field select Order_ID as Output column,select count as function and Order_ID as Input column position.


Now in tMap settings drop Shippers_ID from row 2 to row 3 and drag and drop Order_ID,Shippers_ID from row 2 column to output column and Shippers_Name from row 3 to output column so that these columns can appear in our output.


In the Basic settings of tLogRow components, select Table (print values in cells of a table) to show result in table format.
Atlast Run the Job.



Here you can see that Shipper_ID 1 has 6 Order_ID each shipper is counting how many orders are there and there names are displayed.

How to use tSplitRow component in Talend

In my previous post of tnormalize I have shown you how to split a multi valued attributes column into individual rows in this comma separated values are split into multiple rows but in 
tSplitRow it splits one row into multiple rows.

This is Student Input File

Student_ID ,Institution,Address,City,Country,Course 1,Course 2,Course 3
101,NIIT,722 Bur Oak Avenue ,Berlin,Germany,CS,EC,EE
102,AIIM,11 Collaroy  2093,Medrid,Spain,MBA,BBA,BCA
103,SRIT,223 Wellington  5011,London,UK,BSC,ME,MTECH

So firstly create a new job from Job Designs > Create Job.
Drag the schema of Student.csv from Metadata > File Delimited > Studentsplitrow.csv and drop it to the design work space and select tfileInputDelimited option from pop window.This Student.csv file is been taken as an input file.
Or you can simply drag this component from the palette and double click on it to open the component properties and click [...] next to the File Name field to specify the path where you have created your Student.csv file.
Then drag and drop the following components from the palette into the design workspace:-tSplitRow,tLogRow.
Connect each component by right clicking and select Row > Main.


Open the component properties of tSplitRow here we want different record for every Courses for every Student.So we will drag Institution_ID,Institution_Name from Student(Input-Main) to tSplitRow_1(Output) and add Institution_Address and Institution_Courses fields as output to tSplitRow component.


In the column mapping table of tSplitRow click the (+) sign to add mapping from source to target record. If the source input file has 3 records and 3 mappings are defined in the Columns Mapping table in tSplitRow component then the output will provide 9 records.  


  • On the first parameter line, write row1.Institution_ID in Institution_ID field. Type row1.Institution_Name in the Institution_Name field and row1.Address+" "+row1.City+" "+row1.Country in the Institution_Address field.Type row1.Course_1 in Institution_Courses field.
  • On the second parameter line, write row1.Institution_ID in Institution_ID field. Type row1.Institution_Name in the Institution_Name field and row1.Address+" "+row1.City+" "+row1.Country in the Institution_Address field.Type row1.Course_2 in Institution_Courses field.
  • On the third parameter line, write row1.Institution_ID in Institution_ID field. Type row1.Institution_Name in the Institution_Name field and row1.Address+" "+row1.City+" "+row1.Country in the Institution_Address field.Type row1.Course_3 in Institution_Courses field.
Here in Institution_Address field I have concatenated Address,City and Country Column together.
In the Basic settings of tLogRow components, select Table (print values in cells of a table) to show result in table format.

  • Atlast Run the Job.


Starting job how_to_use_tSplitRow_in_Talend at 13:54 29/04/2015.

[statistics] connecting to socket on port 3988
[statistics] connected
.--------------+----------------+----------------------------------+-------------------.
|                                      tLogRow_1                                       |
|=-------------+----------------+----------------------------------+------------------=|
|Institution_ID|Institution_Name|Institution_Address               |Institution_Courses|
|=-------------+----------------+----------------------------------+------------------=|
|101           |NIIT            |722 Bur Oak Avenue  Berlin Germany|CS                 |
|101           |NIIT            |722 Bur Oak Avenue  Berlin Germany|EC                 |
|101           |NIIT            |722 Bur Oak Avenue  Berlin Germany|EE                 |
|102           |AIIM            |11 Collaroy  2093 Medrid Spain    |MBA                |
|102           |AIIM            |11 Collaroy  2093 Medrid Spain    |BBA                |
|102           |AIIM            |11 Collaroy  2093 Medrid Spain    |BCA                |
|103           |SRIT            |223 Wellington  5011 London UK    |BSC                |
|103           |SRIT            |223 Wellington  5011 London UK    |ME                 |
|103           |SRIT            |223 Wellington  5011 London UK    |MTECH              |
'--------------+----------------+----------------------------------+-------------------'

[statistics] disconnected
Job how_to_use_tSplitRow_in_Talend ended at 13:54 29/04/2015. [exit code=0]

Here you can see that 3 records are split into 9 records each Institution has multiple courses.


Monday, April 27, 2015

How to use tReplace component in Talend

tReplace component is used to replace a value it carries out a search & replace operation in the input columns defined.

This is Input Customer File.

Cust_First_Name,Cust_Address,Cust_City,Cust_Country
Lee,722 Bur Oak Avenue ,Berlin,Germany
William,543 ANJOU ANJOU H1K 2P4,Mexico,Mexico
Nick,40  Toronto M8Z 3Z7,Mexico,Mexico
Jian,11 Collaroy  2093,London,UK
Ming,223 Wellington  5011,Berlin,Germany
Joseph,324 Tman Street   Stafford Heights ,Mannheim,Germany
Justin,90 Ruapehu Road   Ohakune ,Strasbourg,France
Ming,32 Wilfred road,Madrid,Spain
Lee,Williamstown Road,Marseille,France

Firstly create a new job from Job Designs > Create Job.
Drag the schema of Customer.csv from Metadata > File Delimited > Suppliers and drop it to the design work space and select tfileInputDelimited option from pop window.This Customer.csv file is been taken as an input file
Or you can simply drag this component from the palette and double click on it to open the component properties and click [...] next to the File Name field to specify the path where you have created your Customer.csv file.
Then drag and drop the following components from the palette into the design workspace:-tReplace,tLogRow.
Connect each component by right clicking and select Row > Main.



Open the component properties of tReplace first click on Edit Schema button so that the input is routed directly to the output. Select the columns you want to replace and drag it to the tReplace_1(Output) and press the OK button.
Now this process can be done by two ways so let us see both the process one by one.

  1. Simple Mode
Select the Simple mode check box click the plus(+) sign to add some lines to the parameters table.
  • On the first parameter line, select Cust_First_Name as InputColumn. Type "Lee" in the Search field, and "Leelon" in the Replace field.
  • On the second parameter line, select Cust_Address as InputColumn. Type "2" in the Search field, and "4" in the Replace field.
  • On the third parameter line, select again Cust_City as InputColumn. Type "Mexico" in the Search field, and "California" in the Replace field.
  • On the fourth paramater line, select Cust_Country as InputColumn. Type "Germany" in the Search field, and "UK" in the Replace field.

2. Advanced Mode

Select the Advanced mode check box click the plus(+) sign to add some lines to the parameters table.

  • On the first parameter line, select Cust_First_Name as Source. Type "[A-Za-z]*ee" in the Pattern field, and "Leelon" in the Replace field.This means that the word which ends by "ee" will be replaced by "Leelon"
  • On the second parameter line, select Cust_Address as Source. Type "2" in the Pattern field, and "4" in the Replace field.
  • On the third parameter line, select again Cust_City as Source. Type  "[A-Za-z]*ico" in the Pattern field, and "California" in the Replace field.This means that the word which ends by "ico" will be replaced by "California".
  • On the fourth paramater line, select Cust_Country as Source. Type "[A-Za-z]*many" in the Pattern field, and "UK" in the Replace field.This means that the word which ends by "many" will be replaced by "UK",


Running both the Mode will give the same result as shown in below screen.


Here you can see that in Cust_First_Name "Lee" is replaced by "Leelon" and the word ended by *ee is also replaced by "Leelon".In Cust_Address 2 is replaced by 4.In Cust_City "Mexico" is replaced by "California" and the word ended by *ico is also replaced by "California".In Cust_Country "Germany" is replaced by "UK" and the word ended by *many is also replaced by "UK".

Both process of Simple mode and Advanced mode are explained. tReplace is used to swap out one value for another based on text patterns.

How to use tFilterColumns component in Talend


tFilterColumns component is used for removing the unwanted columns from the job flow.We can do the same process from tMap but tFiltercolumns make is more explicit that we are filtering the columns from one flow to another.

This is our Shippers Input File .
Shipper_ID,Shipper_Name,Phone
1,Shiny Shipping,4133694044
2,Rose Marry Ship Pvt,5703453631
3,Nick Ltd,3038141672
4,Michle Ltd,7274882222

So firstly create a new job from Job Designs > Create Job.
Drag the schema of Shippers.csv from Metadata > File Delimited > Shippers and drop it to the design work space and select tfileInputDelimited option from pop window.This Shippers.csv file is been taken as an input file
Or you can simply drag this component from the palette and double click on it to open the component properties and click [...] next to the File Name field to specify the path where you have created your Shippers.csv file.
Then drag and drop the following components from the palette into the design workspace:-tFilterColumns,tLogRow.
Connect each component by right clicking and select Row > Main.


Then double click on tFilterColumns to open its component properties click on Edit Schema this window will open now there are three columns in Shippers Input you want two columns from input to be filtered so simply drag it from Shippers (Input-Main) to tFilterColumns_1(Output).Now we want only two columns Shipper_ID and Shipper_Name instead of three columns.



In the Basic settings of tLogRow components, select Table (print values in cells of a table) to show result in table format.
Atlast Run the Job.


Here you can see that we have three columns in our Input File and at result we have two columns that we want to filter.

How to use tSortRow component in Talend

This tutorial explains how to sort data, write it into a temporary file, and replace the source file with the temporary file.
It comprises of two Subjobs:

- sorting data in a temporary file,
- replacing the source file by that temporary file.


tSortRow component sorts input data based on one or several columns, by sort type and order.

This is Suppliers Input
SupplierID,SupplierName,ContactName,Address,City,PostalCode,Country
1,Debra,Roberts,1035 Gales Ave   Winston Salem Winston Salem 27103-4579,London,8605932890,UK
2,Elliot,BOLICK,4402 w avenida del sol   Glendale Glendale 85310-3915,California,2162256986,USA
3,Michael,Thompson,1869 GRAND VIEW DR   OAKLAND OAKLAND 94618-2339,Texas,5414000606,USA
4,JOE,Salamida,23731 Howard St. PO Box 333  Covelo Covelo 95428-0333,Tokyo,9198488887,Japan
5,Stephen,Schofield,6750 Beecher Road   Clayton Lenawee 49235-9655,Madrid,6317038390,Spain
6,Mark,chung,3470 Tilden St   PHILADELPHIA PHILADELPHIA 19129-1435,Saitama,4083996114,Japan
7,James,B Brown,3250 NE 1st Ave Apt. 918   MIami MIami 33137-4097,Texas,3128131305,USA
8,Albert,Linden,9000 Bay Hill Blvd Suite 300  Orlando Orlando 32819-4880,Birmingham,2252528310,UK
9,Carey,Liu,Roberts Communications 64 Commercial Street  Rochester Rochester 14614-1010,Sao Paulo,9703192376,Brazil
10,Enrique,Cohen,1026 County Rd. 112   Carbondale Carbondale 81623-9642,Sao Paulo,6578644245,Brazil

Firstly create a new job from Job Designs > Create Job.
Drag the schema of Suppliers.csv from Metadata > File Delimited > Suppliers and drop it to the design work space and select tfileInputDelimited option from pop window.This Suppliers.csv file is been taken as an input file.Or you can simply drag this component from the palette and double click on it to open the component properties and click [...] next to the File Name field to specify the path where you have created your Suppliers.csv file.
Then drag and drop the following components from the palette into the design workspace:-tSortRow,tfileOutputDelimited

Connect each component by right clicking and select Row > Main.



Then double click on tSortRow to open the component view .In Basic Settings define under Criteria Table by clicking (+) buton to add a line .Select the column you want to sort and select sort num or alpha option or asc or desc option based on your requirement.For example I have selected SupplierName,alpha and asc option.



Now double click on tfileOutputDelimited in the wizard, define the same path as for the Suppliers.csv file but name it as output_sortrow.csv.

Check the Include Header box to retrieve the column names.

Run the Job. Now will see the Job has created a new file named output_sortrow.csv containing the sorted data.Our purpose of the Job was to sort the source file and not to create a new one.

So here in next step we will see how to replace the source file by the new one.

Drag and drop a new component tFileCopy from the palette into the design workspace.
And connect by right click on tFileInputDelimitedand select Trigger > OnSubjobOk from the menu and drag a line to tFileCopy.



Open the component properties of tFileCopy now to copy the output_sortrow.csv file containing the sorted data,specify the file path in the File Name field .
In the Destination directory field specify the folder where you want to copy the file and select the file path of the Suppliers.csv source file.

To replace the source file with the sorted file, check the Rename box and write “Suppliers.csv”.
To delete the temporary file, click checkbox of Remove source file and tick check box of Replace Existing File and Create the Directory if it doesn’t exists.


At last Run the Job you will see that source file has been replaced with the temporary file.


Wednesday, April 22, 2015

How to use tFilterRow component in Talend


tFilterRow component is used to filter input rows by setting conditions on the selected columns.

This component uses simple condition and a regular expression to filter a list of records.This scenario will output two tables: the first output will list all records where Order_ID is equals to 11248 or Shipper_ID equals to "1" and the second output will list all rejected records. An error message for each rejected record will display in the same table to explain why such records has been rejected.

This is Orders Input

Order_ID,Customer_ID,Employee_ID,Order_Date,Shipper_ID
11248,1,1,41655,3
11248,2,1,41656,1
11248,4,1,41657,4
11249,9,3,41658,1
11249,10,8,41659,1
11250,11,2,41660,1
11250,13,6,41661,3
11250,6,7,41662,2
11251,7,6,41663,2
11251,3,4,41664,4
11251,2,4,41665,2
11252,11,7,41666,1
11252,7,3,41667,2
11252,5,4,41668,1
11253,8,4,41669,4

Create a new job then create metadata of Orders.CSV file drag it from Metadata > FileDelimited > Orders to design work sapce and select tFileIinputDelimited from pop upwindow.

Drop tFilterRow and tLogRow from the Palette on the design workspace.
Connect the Orders File Delimited to the tFilterRow, using a Row > Main link. 
Then, connect the tFilterRow to the tLogRow, using a Row > Filter link and rename it as Filter. 
Then again drop tLogRow from the Palette into the design workspace and rename it as Reject and then again connect the tFilterRow to the reject, using a Row > Reject link.

Open the tFilterRow component and define its properties.



Sync Columns to propogate the metadata.

  • In the Conditions table, fill the filtering parameters.
  • In InputColumn, select Order_ID, Function as Empty,Operator as Equals.
  • Or you can select other options from dropdown list such as greater than,lower based on your requirement .

  • In the Value column, type 11248 to filter .
  • Now if you want to use Use advanced mode then click check box and type in the following regular expression that includes ID to be searched---input_row.Shipper_ID.equals(1)
  • To combine both conditions (simple and advanced), select Or as logical operator for this example.You can also use And operator based on your requirement.
  • In the Basic settings of tLogRow components, select Table (print values in cells of a table) to show result in table format.

Atlast Run the job.


Here you can see that 8 rows are filtered and 7 rows are rejected.
The records which fullfill the criteria of Order_ID =11248 Or Shipper_ID = 1 they are in filtered records and rest of them are in rejected records.Each rejected record has a corresponding error message that explains the reason of rejection.