DRY HiveQL

DRY (don’t repeat yourself) is one of the fundamental principles of software engineering. The main idea is to avoid duplicating business/processing logic throughout the code. However, I rarely see it being applied when writing SQL queries; making it difficult to understand and maintain them.

Below are few tips on making HiveQL DRY.

Quick Summary

  1. Use “Macro” to express transformation logic that can be easily expressed in SQL.
  2. Consider using “With Clause” to express subqueries.
  3. Use variables to define constants.

Macros

Macros allows to assign an alias to a reusable processing logic that can be expressed in SQL. In simple terms, its like defining a function purely in SQL (although it doesn’t operate that way. It will do an inline expansion but we don’t have to worry about it for now).

For instance, in the below table we have two duration fields where the duration value is expressed in different units (such as milliseconds, seconds, minutes, etc).

UUID duration1 duration2
1 10ms 20us
2 16s 20ms
3 5m 2us

Below is a typical bad way of writing the HiveQL for this. Its bad because we have duplicated (once for each field) the logic of converting the duration expressed as string to duration in seconds. Anytime we make changes to it we will have to make sure to update the logic everywhere in the code.

SELECT
    UUID, 
    CASE 
        WHEN duration1 like '%us' 
             THEN CAST(REPLACE(duration1, 'us', '') AS DOUBLE) / 1.0E6
        WHEN duration1 like '%ms' 
             THEN CAST(REPLACE(duration1, 'ms', '') AS DOUBLE) / 1000.0
        WHEN duration1 like '%s' 
             THEN CAST(REPLACE(duration1, 's', '') AS DOUBLE) 
        WHEN duration1 like '%m' 
             THEN CAST(REPLACE(duration1, 'm', '') AS DOUBLE) * 60
        ELSE NULL
    END as duration1_seconds,
    CASE 
        WHEN duration2 like '%us' 
             THEN CAST(REPLACE(duration2, 'us', '') AS DOUBLE) / 1.0E6
        WHEN duration2 like '%ms' 
             THEN CAST(REPLACE(duration2, 'ms', '') AS DOUBLE) / 1000.0
        WHEN duration2 like '%s' 
             THEN CAST(REPLACE(duration2, 's', '') AS DOUBLE) 
        WHEN duration2 like '%m' 
             THEN CAST(REPLACE(duration2, 'm', '') AS DOUBLE) * 60
        ELSE NULL
    END as duration2_seconds
FROM 
(
    SELECT 1 AS UUID, '10ms' as duration1, '20us' as duration2
    UNION ALL
    SELECT 2 AS UUID, '16s' as duration1, '20ms' as duration2
    UNION ALL
    SELECT 3 AS UUID, '5m' as duration1, '2us' as duration2
) A

DRY way to rewrite the above query is to utilize “macro”. We first define a macro “DURATION_IN_SECONDS” and use it convert all the duration fields as shown below.

-- define macro to convert duration string to duration in seconds
CREATE TEMPORARY MACRO DURATION_IN_SECONDS (t string)
    CASE 
        WHEN t like '%us' 
             THEN CAST(REPLACE(t, 'us', '') AS DOUBLE) / 1.0E6
        WHEN t like '%ms' 
             THEN CAST(REPLACE(t, 'ms', '') AS DOUBLE) / 1000.0
        WHEN t like '%s' 
             THEN CAST(REPLACE(t, 's', '') AS DOUBLE) 
        WHEN t like '%m' 
             THEN CAST(REPLACE(t, 'm', '') AS DOUBLE) * 60
        ELSE NULL
    END;

SELECT
    UUID, 
    -- use macro to convert first duration field
    DURATION_IN_SECONDS(duration1) duration1_seconds,
    -- use macro to convert second duration field
    DURATION_IN_SECONDS(duration2) duration2_seconds
FROM 
(
    SELECT 1 AS UUID, '10ms' as duration1, '20us' as duration2
    UNION ALL
    SELECT 2 AS UUID, '16s' as duration1, '20ms' as duration2
    UNION ALL
    SELECT 3 AS UUID, '5m' as duration1, '2us' as duration2
) A

With Clause

Below is an example of another bad query. In the SQL below, we use tableC to filter tableA and tableB and then join the two together. The logic on how to filter tableC itself has been duplicated.

SELECT *
FROM
(
    SELECT TableA.*
    FROM TableA 
    JOIN TableC
    ON (TableA.id = TableC.id)
    WHERE 
        TableA.datestr >= '2017-01-01'
        -- filters on table C 
        AND TableC.datestr >= '2017-01-01' 
        AND TableC.status != 0
) A 
JOIN
(
    SELECT TableB.*
    FROM TableB
    JOIN TableC
    ON (TableB.id = TableC.id)
    WHERE 
        TableB.datestr >= '2017-01-01'
        -- filters on table C 
        AND TableC.datestr >= '2017-01-01' 
        AND TableC.status != 0

) B 
On (A.id = B.id)

Here, using “With” clause can help us make this query DRY. We first express the logic of filtering table C and assign it an alias. Next we join tableA and tableB to this alias.

-- express logic to filter table C over here
WITH FilteredTableC AS 
    (
        SELECT * 
        FROM TableC
        WHERE 
            datestr >= '2017-01-01'
            AND status != 0
    )


SELECT *
FROM
(
    SELECT TableA.*
    FROM TableA
    JOIN FilteredTableC
    ON (TableA.id = FilteredTableC.id)
    WHERE 
        TableA.datestr >= '2017-01-01'
) A 
JOIN
(
    SELECT TableB.*
    FROM TableB
    JOIN FilteredTableC
    ON (TableB.id = FilteredTableC.id)
    WHERE 
        TableB.datestr >= '2017-01-01'
) B 
On (A.id = B.id)

“With” Clause not only helps with making a SQL DRY, but is also very useful in breaking a big sql involving many joins into smaller easy self summarizing chunks. For instance below is an example of a query that joins three tables together. Even in this simple query it becomes difficult to understand the goal as there is a list of filters that we are applying to different tables.

SELECT drivers.*, riders.*
FROM trips
JOIN drivers ON drivers.driver_id = trips.driver_id
JOIN riders ON riders.rider_id = trips.rider_id 
WHERE
    trips.datestr >= '2017-01-01'
    AND trips.status = 0
    AND trips.city = 'SF'
    AND drivers.joined >= '2017-01-01'
    AND drivers.status = 'active'
    AND riders.joined >= '2017-01-01'
    AND riders.name like 'XYZ%'

Using “With Clause” allows to rewrite the above query in much more legible way. Each table is separately filtered and assigned a readable alias which is then used in the main query.

WITH SuccessfulTrips as 
    (
        SELECT *
        FROM trips
        WHERE 
            trips.status = 0
            AND trips.datestr >= '2017-01-01'
    ),
    ActiveDrivers as 
    (
        SELECT * 
        FROM drivers
        WHERE 
            drivers.status = 'active'
            AND drivers.joined >= '2017-01-01'
    ),
    XYZRiders as 
    (
        SELECT * 
        FROM riders
        WHERE 
            riders.name like 'XYZ%'
            AND riders.joined >= '2017-01-01'    
    )

SELECT ActiveDrivers.*, XYZRiders.*
FROM SuccessfulTrips 
JOIN ActiveDrivers ON (SuccessfulTrips.driver_id = ActiveDrivers.driver_id)
JOIN XYZRiders ON (SuccessfulTrips.rider_id = XYZRiders.rider_id)

Variables

Often we use same constant values in multiple places. Instead of copying these constant values all over the place we can easily define a variable and use the variable.

SET start_date = '2017-01-01';
SET end_date = '2017-05-01';

SELECT A.*, B.*
FROM A.*
JOIN B.* ON (A.id = B.id)
WHERE
    A.datestr >= ${hiveconf:start_date} 
    AND A.datestr <= ${hiveconf:end_date}
    AND B.datestr >= ${hiveconf:start_date}
    AND B.datestr <= ${hiveconf:end_date}

There are few different options for setting variables in hive. Make sure to read comments on this stackoverflow post.

Advertisements
Posted in Big Data, General | Tagged , , | Leave a comment