UPSERT in Hive(3 Step Process)

By Akshay Agarwal May 07, 2016

In this post we'll learn an efficient 3 step process, for performing UPSERT in hive on a large size table containing entire history.

Just for the audience not aware of UPSERT - It is a combination of UPDATE and INSERT. If on a table containing history data, we receive new data which needs to be inserted as well as some data which is an UPDATE to the existing data, then we have to perform an UPSERT operation to achieve this.

Prerequisite – The table containing history being very large in size should be partitioned, which is also a best practice for efficient data storage, when storing large volume of data in any Big Data warehouse.

Business scenario – Lets take an example of click stream data of a website, as gathered from different browsers of visitors who visited the website. The site_view_hist table contains the clicks and page impressions counts from different browsers and the table is partitioned on hit_date(the date on which the visitor hitted or visited the website).

Clicks – number of clicks(Eg on advertisements displayed) done by visitor on website page.
Impressions – number of times the website pages or different sections were viewed by the visitor.

Problem statement - If we receive correction in the number of clicks and impressions as recorded by browser, we need to update them in the history table and also insert any new records we received.

Lets dive into it:
In the history table we have browser_name and hit_date as a composite key which will remain constant and we receive updates in the values of clicks_count and impressions_count columns.

DDL of history table

Data:

Now suppose we receive updated records for date 2016-01-01(marked in blue) for firefox and chrome browsers, and we also received a new record(iexplorer) for 2016-01-31. Let us store these new and updated records in the following raw table:

DDL of Raw table

Data

Now we need an UPSERT solution, which updates the records of site_view_hist table for hit_date 2016-01-01 and insert the new record for 2016-01-31.

SOLUTION (3 STEP):
To achieve this in an efficient way, we will use the following 3 step process:
Prep Step - We should first get those partitions from the history table which needs to be updated. So we create a temp table site_view_temp1 which contains the rows from history with hit_date equal to the hit_date of raw table.
This will bring us all the hit_date partitions from history table for which atleast one updated record exists in the raw table.
NOTE - Instead of table we can also create a view for efficient processing and saving storage space.

Data of site_view_temp1 table:

Step 1 – From these fetched partitions we will separate the old unchanged rows. These are the rows in which there is no change in the clicks and impressions count. For this we will create a temp table site_view_temp2 as follows:

Data of site_view_temp2 table:

Step2 – Now we will insert into this new temp table, all the rows from the raw table. This step will bring in the updated rows as well as any new rows. And since site_view_temp2 already contained the old rows, so it will now have all the rows including new, updated, and unchanged old rows. Following query does this:

New Data of site_view_temp2 table

Step3 – Now simply insert overwriting the site_view_hist table with site_view_temp2 table, will provide us the required output rows including two updated rows for 2016-01-01 and one new inserted row for 2016-01-31.
CATCH – Since the history table is partitioned on the hit_date, the respective partitions will only be overwritten as follows:

Final history table with updated and inserted rows:

Benefits of this approach:

In the prep step itself since we are fetching just the partitions we have to update, so we are not scanning the whole history table. This makes our processing faster.
In the final step as we are insert overwriting the history with the temp table, we are touching just the partition we want to update along with a new partition created for the new record.This gives a high performance gain, as I gained for my production process on a 6.7 TB history table with over 5 billion records. But since my 3 step process(included in one hive script) just touched few partitions of few thousand rows, the process completed swiftly.

Comments

UnknownJanuary 1, 2017 at 10:24 PM
very nice solution sir :)
ReplyDelete
Replies
ChiragAugust 17, 2017 at 4:20 PM
Hi Akshay ,

What if we receive null in "hit_date" first day and then we receive actual date . How to handle such scenario .
ReplyDelete
Replies
pritsNovember 1, 2017 at 5:27 PM
Hi Akshay,

I have a bit different use case, let's say I have a table T1 with attributes A1, A2, A3, PARTITION_KEY.
My PARTITION_KEY is dynamically generated for a given day so there might be same row inserted for previous day but also re-published today and ideally this same data has different PARTITION_KEY but is same data that I need to update. Also, doing overwrite partition might wipe out other data sitting in the old partition.

One solution that I can think of is, fetch PARTITION_KEY for data to be updated then fetch all data for those PARTITION_KEYs and continue your steps mentioned. Do you think this is feasible or can be optimized in a more better way ?
ReplyDelete
Replies
TejutejuJune 26, 2018 at 2:00 AM
Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking. Big data Hadoop online Training India
ReplyDelete
Replies
James ZicrovSeptember 26, 2019 at 6:49 AM
I think Upsert functions such as Update and Delete are always the best when it comes to providing some information about database functions.

SSIS Upsert
ReplyDelete
Replies
Akshay AgarwalOctober 6, 2019 at 3:23 PM
Hi James Zicrov. Can you please elaborate on your comment. Because this article I published in 2016 focuses on Hive(Hadoop's Data Warehouse solution), which though started supporting UPDATE and DELETE but with several restrictions for transactions support, ranging from file format to restriction of bucketing the data etc. I published separate article on how to perform UPDATE and DELETE in Hive. You can please find the same in the Blog archive on the top right side of this current page. But UPDATE and DELETE operations in Hive comes with several restrictions.
This approach achieves UPSERT efficiently by utilizing the partitioned storage of data in HDFS(or any other file system) and also does this irrespective of the underlying file format of data and overcoming other restrictions as well.
This article focuses on Big Data warehouse solution-Hive and not SSIS. Hence I would appreciate if you can please elaborate your point.
ReplyDelete
Replies
UnknownDecember 9, 2019 at 7:37 AM
How did you load data in site_view_raw table?
ReplyDelete
Replies