arundhaj

regression towards the datascience

Pig script to process CSV file with quotes and multiline

 

While writing Pig script, usually we use PigStorage for loading a CSV file.

Consider a sample CSV file in the following format.

2,Loading successfull,2014-09-25
3,Loading successfull,2014-09-25
4,Loading successfull,2014-09-25

can be loaded as

logs = LOAD 'log_folder/log_file.csv' USING PigStorage(',') AS (id: long, message: chararray, timestamp: chararray);

However, I had a CSV file containing double quotes and also a single record spanning multiple lines. In the format shown below.

"2","Loading
successfull","2014-09-25"
"3","Loading
successfull","2014-09-25"
"4","Loading
successfull","2014-09-25"

this sort of CSV file can be loaded using org.apache.pig.piggybank.storage.CSVExcelStorage

logs = LOAD 'log_folder/log_file.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE') AS (id: long, message: chararray, timestamp: chararray);

Hope this helps.

Comments