Select Page

Trifacta Logo

So we have been trundling along with Trifacta and now have exported our recipe, and now we want to do cool stuff with it. One of the cool things that we can do now is use the logic from Trifacta and produce our own little mini parsing engine that we have complete control over. I thought I would quickly outline a few good places to start in our favorite language Python.

Getting Started

The goal with parsing data with a Trifecta recipe in Python is to get into memory a bunch of objects with the correct information attached.

This bit assumes you have already got the data in a variable from a file, s3 or some other similar source. Now let’s have a look at the Trifacta script to see what’s going on.

splitrows col: column1 on: '\r\n'

As we can see with this line the first thing Trifacta does with a file is to split it on lines into rows. This can be easily done in Python:

lines = file_body.split("\r\n")

This will store all the lines of the file in a new lines variable.
Once we have that the next thing that is commonly done in Trifacta is to get rid of any lines we don’t care about, or to keep only the lines we do care about. Here is an example of a bit of both happening:

keep row: matches([column1], `New client connection`)
delete row: matches([column1], `rejected from server`)

Here we are keeping any rows that contain ‘New client connection’ but not ‘rejected from server’. Again, this is not difficult in Python:

def parse_item(item):
    data = item["Data"]
    if "New client connection" in data and "rejected from server" not in data:
        first_parsed_lines.append(item)

Everything from here on in depends a bit on what the data set looks like. I will just show some examples of how to do common parsing here.

Date/Time

Depending on how you did things in Wrangler the script will describe time/date parsing something like this:

split col: column1 on: `T` before: `{time}`
rename col: column2 to: 'Date'
split col: column3 on: ` `
rename col: column1 to: 'Time'

We can see that Trifacta has an internal for ‘{time}’ which will match any strings that follow recognized date times. We can mimic this behavior in Python, though it makes for cleaner code if we already know the date format we will be parsing. The date/time field in this example looks like ‘2015-10-22T00:00:53,354’. To help us parse this we will use Python’s regex compiler/matcher.

date_pattern = re.compile('([0-9]{4}-[0-9]{2}-[0-9]{2})T([0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]*)(\s)',re.IGNORECASE) 
date_string = date_pattern.match(line_data) 
metadata_data['Date'] = date_string.group(1) 
metadata_data['Time'] = date_string.group(2) 

You can see here that this produces two outputs, one for time and one for date. Once we have the regex figured out parsing this is simple. Another thing that can help us with this parsing that Trifacta also does is to clip the already parsed section off the string so we don’t have to worry about it anymore. The syntax for Python is a little harder to read, but just as effective.

line_data = line_data[len(date_string.group(0)):]

The takes the length of the result string and chops that off the data we are parsing from. Be careful here if the data string was not at the front of the string as you may lose data.

Log4J Event Levels

Another common thing to parse off a file is the severity of an event in a log. As these are wrapped in square brackets it is a little easier than dates!
From:

extract col: column1 on: `[{upper}{4,6}]`

To:

event_pattern = re.compile('\[(\D{4,6})\]\s', re.IGNORECASE)

Emails

Again emails are easy to parse with:

user_pattern = re.compile(':(\w+@\w+)\s-\s')

Things to we can do now

As we now have this data in memory in python we can muck around with the objects as we would anything else in Python. One of my favorite things to do is to load all this nicely parsed data into DynamoDB which would be more difficult in Trifacta than Python. Another thing we can do now is some pattern matching or visualizations to have a wider view of the data.
Trifacta has given us an easy recipe to apply to the data, and Python is the easiest language to apply it with!

Hope this gives you the kickstart with your dataset

Coffee to Code – Tim Gray

Tim blogs about the sharp end of code and the languages we write in.

You can read Tim’s post, Exporting and Importing Trifacta Wrangler Recipes or all of Tim’s blogs here.

We run regular business intelligence courses in both Wellington and Auckland. Find out more here.

%d bloggers like this: