Below are few tips on making HiveQL DRY.
Macros allows to assign an alias to a reusable processing logic that can be expressed in SQL. In simple terms, its like defining a function purely in SQL (although it doesn’t operate that way. It will do an inline expansion but we don’t have to worry about it for now).
For instance, in the below table we have two duration fields where the duration value is expressed in different units (such as milliseconds, seconds, minutes, etc).
UUID | duration1 | duration2 |
---|---|---|
1 | 10ms | 20us |
2 | 16s | 20ms |
3 | 5m | 2us |
Below is a typical bad way of writing the HiveQL for this. Its bad because we have duplicated (once for each field) the logic of converting the duration expressed as string to duration in seconds. Anytime we make changes to it we will have to make sure to update the logic everywhere in the code.
SELECT UUID, CASE WHEN duration1 like '%us' THEN CAST(REPLACE(duration1, 'us', '') AS DOUBLE) / 1.0E6 WHEN duration1 like '%ms' THEN CAST(REPLACE(duration1, 'ms', '') AS DOUBLE) / 1000.0 WHEN duration1 like '%s' THEN CAST(REPLACE(duration1, 's', '') AS DOUBLE) WHEN duration1 like '%m' THEN CAST(REPLACE(duration1, 'm', '') AS DOUBLE) * 60 ELSE NULL END as duration1_seconds, CASE WHEN duration2 like '%us' THEN CAST(REPLACE(duration2, 'us', '') AS DOUBLE) / 1.0E6 WHEN duration2 like '%ms' THEN CAST(REPLACE(duration2, 'ms', '') AS DOUBLE) / 1000.0 WHEN duration2 like '%s' THEN CAST(REPLACE(duration2, 's', '') AS DOUBLE) WHEN duration2 like '%m' THEN CAST(REPLACE(duration2, 'm', '') AS DOUBLE) * 60 ELSE NULL END as duration2_seconds FROM ( SELECT 1 AS UUID, '10ms' as duration1, '20us' as duration2 UNION ALL SELECT 2 AS UUID, '16s' as duration1, '20ms' as duration2 UNION ALL SELECT 3 AS UUID, '5m' as duration1, '2us' as duration2 ) A
DRY way to rewrite the above query is to utilize “macro”. We first define a macro “DURATION_IN_SECONDS” and use it convert all the duration fields as shown below.
-- define macro to convert duration string to duration in seconds CREATE TEMPORARY MACRO DURATION_IN_SECONDS (t string) CASE WHEN t like '%us' THEN CAST(REPLACE(t, 'us', '') AS DOUBLE) / 1.0E6 WHEN t like '%ms' THEN CAST(REPLACE(t, 'ms', '') AS DOUBLE) / 1000.0 WHEN t like '%s' THEN CAST(REPLACE(t, 's', '') AS DOUBLE) WHEN t like '%m' THEN CAST(REPLACE(t, 'm', '') AS DOUBLE) * 60 ELSE NULL END; SELECT UUID, -- use macro to convert first duration field DURATION_IN_SECONDS(duration1) duration1_seconds, -- use macro to convert second duration field DURATION_IN_SECONDS(duration2) duration2_seconds FROM ( SELECT 1 AS UUID, '10ms' as duration1, '20us' as duration2 UNION ALL SELECT 2 AS UUID, '16s' as duration1, '20ms' as duration2 UNION ALL SELECT 3 AS UUID, '5m' as duration1, '2us' as duration2 ) A
Below is an example of another bad query. In the SQL below, we use tableC to filter tableA and tableB and then join the two together. The logic on how to filter tableC itself has been duplicated.
SELECT * FROM ( SELECT TableA.* FROM TableA JOIN TableC ON (TableA.id = TableC.id) WHERE TableA.datestr >= '2017-01-01' -- filters on table C AND TableC.datestr >= '2017-01-01' AND TableC.status != 0 ) A JOIN ( SELECT TableB.* FROM TableB JOIN TableC ON (TableB.id = TableC.id) WHERE TableB.datestr >= '2017-01-01' -- filters on table C AND TableC.datestr >= '2017-01-01' AND TableC.status != 0 ) B On (A.id = B.id)
Here, using “With” clause can help us make this query DRY. We first express the logic of filtering table C and assign it an alias. Next we join tableA and tableB to this alias.
-- express logic to filter table C over here WITH FilteredTableC AS ( SELECT * FROM TableC WHERE datestr >= '2017-01-01' AND status != 0 ) SELECT * FROM ( SELECT TableA.* FROM TableA JOIN FilteredTableC ON (TableA.id = FilteredTableC.id) WHERE TableA.datestr >= '2017-01-01' ) A JOIN ( SELECT TableB.* FROM TableB JOIN FilteredTableC ON (TableB.id = FilteredTableC.id) WHERE TableB.datestr >= '2017-01-01' ) B On (A.id = B.id)
“With” Clause not only helps with making a SQL DRY, but is also very useful in breaking a big sql involving many joins into smaller easy self summarizing chunks. For instance below is an example of a query that joins three tables together. Even in this simple query it becomes difficult to understand the goal as there is a list of filters that we are applying to different tables.
SELECT drivers.*, riders.* FROM trips JOIN drivers ON drivers.driver_id = trips.driver_id JOIN riders ON riders.rider_id = trips.rider_id WHERE trips.datestr >= '2017-01-01' AND trips.status = 0 AND trips.city = 'SF' AND drivers.joined >= '2017-01-01' AND drivers.status = 'active' AND riders.joined >= '2017-01-01' AND riders.name like 'XYZ%'
Using “With Clause” allows to rewrite the above query in much more legible way. Each table is separately filtered and assigned a readable alias which is then used in the main query.
WITH SuccessfulTrips as ( SELECT * FROM trips WHERE trips.status = 0 AND trips.datestr >= '2017-01-01' ), ActiveDrivers as ( SELECT * FROM drivers WHERE drivers.status = 'active' AND drivers.joined >= '2017-01-01' ), XYZRiders as ( SELECT * FROM riders WHERE riders.name like 'XYZ%' AND riders.joined >= '2017-01-01' ) SELECT ActiveDrivers.*, XYZRiders.* FROM SuccessfulTrips JOIN ActiveDrivers ON (SuccessfulTrips.driver_id = ActiveDrivers.driver_id) JOIN XYZRiders ON (SuccessfulTrips.rider_id = XYZRiders.rider_id)
Often we use same constant values in multiple places. Instead of copying these constant values all over the place we can easily define a variable and use the variable.
SET start_date = '2017-01-01'; SET end_date = '2017-05-01'; SELECT A.*, B.* FROM A.* JOIN B.* ON (A.id = B.id) WHERE A.datestr >= ${hiveconf:start_date} AND A.datestr <= ${hiveconf:end_date} AND B.datestr >= ${hiveconf:start_date} AND B.datestr <= ${hiveconf:end_date}
There are few different options for setting variables in hive. Make sure to read comments on this stackoverflow post.
Let’s assume you got a model that can predict house prices. Naturally you won’t trust it unless you evaluate it and establish some confidence on expected error. So, to start with you feed in features (such as room number, lot size, etc) for a certain house and compare the predicted (say 130K) to its actual (say 120K) price. In this particular case we can say that the model over estimated the price by 10K. But a single point is not sufficient to make a general claim about the accuracy or expected error for the given model. So we feed in features for another 1000 houses and for each of them compute error, i.e. difference between predicted and actual price).
From descriptive statistics we know that there are different ways to summarize these 1000 error points. For instance we can summarize the general tendency of the dataset by mean or median or even draw a boxplot to understand the distribution of error.
Since we are interested in a numerical measure (rather than visualization), using “mean” as a way to summarize all the observed error make sense. Thus we can compute mean error.
However there is a problem. What if the error is -10K (i.e under-estimates) for one house and 10K (i.e. over-estimates) for another. Then mean error will be 0. Intuitively this doesn’t make sense. It makes more sense to say that the expected error is 10K i.e. we operate on absolute error rather than on signed (under/over estimate) error. Thus we got all the components of our first metric, namely Mean Absolute Error. To summarize, its called mean absolute error because:
Now, we know that mean is sensitive to outliers. So sometimes instead of mean we use median and the metric is known as median absolute error. The advantage of “Mean/Median Absolute Error” is that its easy to make sense of the number. For instance if the mean absolute error of a model is 20K then we know that if the predicted price is 200K then the actual price is most likely between 180K to 220K.
Data scientists are not only concerned with quantifying the error but are also interested in determining if the model can be improved. To answer this question let’s first establish the best and the worst models.
Best Model
Theoretically, the best model is a model for which the absolute error is zero for all the test cases. As shown in the graph below, if we draw absolute error on x-axis and cumulative percentage of houses on y-axis then a point say (50K, 0.6) indicates that for 60% of houses the absolute error is less than or equal to 50K.
So given this graph how the best model will look like ?
Since absolute error is always zero, the graph will be simply a vertical line starting from 0 on x-axis extending to 100% on y-axis.
Worst Model
Don’t confuse the word “worst” with the word “dump”. Typically for building a regression model we have a target variable (house price) and certain features or predictor variables such as number of rooms, lot size, etc. But what if there are no features available. For instance, the only information provided is house prices for 10K randomly selected houses. We can still build a model simply based on this limited information. For instance we can compute mean house price based on the 10K training samples we have. Now our model will simply return this mean value. Let’s say the mean value is 215K. If we ask this model what will be the price of a house with lot size 5000 sq ft, it will simply return 215K. Let’s call this mean model.
Theoretically it can be shown that when no other information is available mean model will minimize error. Intuitively this makes sense as we often tend to use mean value when we have no other information. The graph below indicates how the curve for the mean model will look like.
Determining scope for improvement
From the above graph, we can easily observe few things. First, as our model becomes better, it will move towards the best model and hence the area between the best model and our model will decrease. On the other hand the area between the worst model and our model will increase. However the total area i.e area between the best and the worst model remains same. Let’s call this area to be the improvement opportunity. As our model get’s better, the more of this improvement opportunity area it covers. This is exactly what R2 metric captures. It indicates what portion of the total improvement opportunity our model covers i.e.
Once we understand the above intuition, its also easy to understand why often there is a confusion of whether R2 ranges from 0 to 1 (as mentioned in wikipedia) or from -1 to 1 (as in sklearn library). If we go by formula 1 in the above graph then R2 will be always positive and between 0 and 1. However this doesn’t tell where our model is in comparison to the mean model. Implicitly it’s made an assumption that our model will be always better than mean model and hence will be in between mean model and the best model.
But in practices its possible that our model is worst than mean model and it falls on right side of the mean model. In that case will be bigger than and hence R2 will be negative.
I hope now we can appreciate the beauty of R2 and understand the intuition behind it.
Luckily using IPython Notebook you can have the goodness of both the worlds. Ipython notebook (especially rpy2 package) allows to seamlessly transfer objects between python and R environment. Below is a brief explanation and code snippet of how data generated/processed in python can be visualized using R’s ggplot. [In hurry ! Sample notebook over here.]
Step 1: Load R Kenel within IPython using rpy2 package
Previously communication between R and Ipython notebook was handled by rmagic extension. Now most of this logic has been abstracted into its own python package known as rpy2. You can install rpy2 using the following command: pip install rpy2 --upgrade
. Once rpy2 is installed, you can initialize R kernel within IPython Notebook using rpy2.ipython
extension as shown below.
%load_ext rpy2.ipython
Step 2: Convert Data To Pandas Dataframe
If you already have some data available as pandas dataframe then feel free to use that data in the next step (and skip this step). If not, let’s randomly select 1000 points from normal distribution using numpy numpy and finally convert it to pandas dataframe. In the next step we will pass this dataframe to R’s ggplot library and plot the density curve.
import pandas as pd import numpy as np data = np.random.randn(5000, 1) df = pd.DataFrame(data, columns=["value"])
Step 3: Using %%R cell magic function
Finally use %%R cell magic function and pass df
(python object pointing to pandas dataframe) using -i
parameter. rpy2 package will make it available within R’s environment by applying necessary transformations. Now we can do anything to this data, including visualizing using R’s ggplot library.
%%R -i df2 -w 800 -h 480 -u px library(ggplot2) ggplot(df) + geom_density(aes(x=value))
Below is the list of some of the important parameters that can be passed to %%R magic function:
Full documentation on parameters can be found over here.
Reference:
1. Revolution Analytics’ Blog On Using R With Jupyter Notebook
2. Stack Overflow
1. %run -i
: Running another notebook in the context of current python kernel
One of the fundamental tenet of object oriented programming is to avoid duplication of code. That was one of issues I always had with IPython Notebook. There are always few classes/functions that you use across different notebooks. Initially I use to copy these functions in each notebook. However, using %run magic function I finally found a solution to the above problem. Magic function %run allows you to run another notebook in the context of current python kernel.
Assuming you defined all the common classes/functions in “common.ipynb” and you want to incorporate those in another notebook (say projectA.ipynb), then invoke the below command to make them available in projectA.ipynb.
%run -i common.ipynb
2. Progress Bars: Keep a check on your iterators.
Progress bars are nice way to keep track of processing time remaining. As shown below, IPython Notebook makes it pretty easy to include a nice-looking progress bar in your notebooks.
from ipywidgets import FloatProgress from IPython.display import display f = FloatProgress(min=0, max=100) display(f) # Increment value of the progress bar within the iterator from time import sleep for i in xrange(100): sleep(0.1) f.value = i
(Yikes!!!.. so much code to get a progress bar). If you feel like me then you should install tqdm package. It makes adding a progress bar with minimal code a breeze.
from tqdm import trange for i in trange(100): sleep(0.1)
3. Unit Testing: Make sure your functions/classes are working fine
Testing code is important and its easy to include unit test in your ipython notebook. Below is an example of how to incorporate unittest
import unittest # Define Person class class Person(object): def __init__(self, name, age): self.__name = name self.__age = age @property def name(self): return self.__name @property def age(self): return self.__age def __str__(self): return &quot;{} ({})&quot;.format(self.name, self.age) def __eq__(self, other): return self.name == other.name and self.age == other.age # Define unit test class PersonTest(unittest.TestCase): def test_initialization(self): p1 = Person(&quot;xyz&quot;, 10) self.assertEqual(&quot;xyz&quot;, p1.name) self.assertEqual(10, p1.age) def test_equality(self): p1 = Person(&quot;xyz&quot;, 10) p2 = Person(&quot;xyz&quot;, 10) self.assertEqual(p1, p2) # Run unit test suite = unittest.TestLoader().loadTestsFromTestCase( PersonTest ) unittest.TextTestRunner().run(suite)
4. Use R’s ggplot to visualize data
Both Python and R have there own pros and cons. Luckily you can have goodness of both the worlds within IPython notebook. Using rpy2 python package you can seamlessly transform data/objects between python and R environment. Checkout more about this in one of my another blog post over here.
where
Covariance matrix for a dataset with independent feature is a diagonal matrix. For a diagonal matrix we can easily show that
Using the above two properties of the diagonal matrix we can show that equation 1 essentially same as equation 2 when features are independent. Let’s first tackle in equation 1. Since determinant of a diagonal matrix is equal to the product of diagonal elements we can rewrite
Now let’s focus on the exponential part in equation 1. Using 3, we can show that
Now can be written as . Thus
Replacing 1 with 5 and 7 we get
Hence proved.
In order to write a custom UDAF you need to extend UserDefinedAggregateFunctions and define following four methods:
initialize
— On a given node, this method is called once for each group.update
— For a given group, spark will call “update” for each input record of that group.merge
— if the function supports partial aggregates, spark might (as an optimization) compute partial result and combine them togetherevaluate
— Once all the entries for a group are exhausted, spark will call evaluate to get the final result.Depending on whether the function supports combiner option or not, the order of execution can vary in the following two ways:
if the function supports partial aggregates
You can read more about the execution pattern in my earlier blog on custom UDAF in hive.
Apart from defining the above four methods you also need to define input, intermediate and final datatype. Below is a example showing how to write a custom function that computes mean.
package com.myuadfs import org.apache.spark.sql.Row import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction} import org.apache.spark.sql.types._ /** * Created by ragrawal on 9/23/15. * Computes Mean */ //Extend UserDefinedAggregateFunction to write custom aggregate function //You can also specify any constructor arguments. For instance you //can have CustomMean(arg1: Int, arg2: String) class CustomMean() extends UserDefinedAggregateFunction { // Input Data Type Schema def inputSchema: StructType = StructType(Array(StructField("item", DoubleType))) // Intermediate Schema def bufferSchema = StructType(Array( StructField("sum", DoubleType), StructField("cnt", LongType) )) // Returned Data Type . def dataType: DataType = DoubleType // Self-explaining def deterministic = true // This function is called whenever key changes def initialize(buffer: MutableAggregationBuffer) = { buffer(0) = 0.toDouble // set sum to zero buffer(1) = 0L // set number of items to 0 } // Iterate over each entry of a group def update(buffer: MutableAggregationBuffer, input: Row) = { buffer(0) = buffer.getDouble(0) + input.getDouble(0) buffer(1) = buffer.getLong(1) + 1 } // Merge two partial aggregates def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = { buffer1(0) = buffer1.getDouble(0) + buffer2.getDouble(0) buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1) } // Called after all the entries are exhausted. def evaluate(buffer: Row) = { buffer.getDouble(0)/buffer.getLong(1).toDouble } }
Below is the code that shows how to use UDAF with dataframe.
import org.apache.spark.sql.Row import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.functions._ import com.myudafs.CustomMean // define UDAF val customMean = new CustomMean() // create test dataset val data = (1 to 1000).map{x:Int => x match { case t if t <= 500 => Row("A", t.toDouble) case t => Row("B", t.toDouble) }} // create schema of the test dataset val schema = StructType(Array( StructField("key", StringType), StructField("value", DoubleType) )) // construct data frame val rdd = sc.parallelize(data) val df = sqlContext.createDataFrame(rdd, schema) // Calculate average value for each group df.groupBy("key").agg( customMean(df.col("value")).as("custom_mean"), avg("value").as("avg") ).show()
Output should be
key | custom_mean | avg |
---|---|---|
A | 250.5 | 250.5 |
B | 750.5 | 750.5 |
— | —– | —– |
Few shortcomings of the UserDefinedAggregateFunction class
As a motivating example assume we are given some student data containing student’s name, subject and score and we want to convert numerical score into ordinal categories based on the following logic:
Below is the relevant python code if you are using pyspark.
# Generate Random Data import itertools import random students = ['John', 'Mike','Matt'] subjects = ['Math', 'Sci', 'Geography', 'History'] random.seed(1) data = [] for (student, subject) in itertools.product(students, subjects): data.append((student, subject, random.randint(0, 100))) # Create Schema Object from pyspark.sql.types import StructType, StructField, IntegerType, StringType schema = StructType([ StructField("student", StringType(), nullable=False), StructField("subject", StringType(), nullable=False), StructField("score", IntegerType(), nullable=False) ]) # Create DataFrame from pyspark.sql import HiveContext sqlContext = HiveContext(sc) rdd = sc.parallelize(data) df = sqlContext.createDataFrame(rdd, schema) # Define udf from pyspark.sql.functions import udf def scoreToCategory(score): if score >= 80: return 'A' elif score >= 60: return 'B' elif score >= 35: return 'C' else: return 'D' udfScoreToCategory=udf(scoreToCategory, StringType()) df.withColumn("category", udfScoreToCategory("score")).show(10)
Line 2-10 is the basic python stuff. We are generating a random dataset that looks something like this:
student | subject | score |
---|---|---|
John | Math | 13 |
… | … | … |
Mike | Sci | 45 |
Mike | Geography | 65 |
… | … | … |
Next line 12-24 are dealing with constructing the dataframe. The main part of the code is in line 27-34. We first define our function in a normal python way.
Below is scala example of the same:
// Construct Dummy Data import util.Random import org.apache.spark.sql.Row implicit class Crossable[X](xs: Traversable[X]) { def cross[Y](ys: Traversable[Y]) = for { x <- xs; y <- ys } yield (x, y) } val students = Seq("John", "Mike","Matt") val subjects = Seq("Math", "Sci", "Geography", "History") val random = new Random(1) val data =(students cross subjects).map{x => Row(x._1, x._2,random.nextInt(100))}.toSeq // Create Schema Object import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType} val schema = StructType(Array( StructField("student", StringType, nullable=false), StructField("subject", StringType, nullable=false), StructField("score", IntegerType, nullable=false) )) // Create DataFrame import org.apache.spark.sql.hive.HiveContext val rdd = sc.parallelize(data) val df = sqlContext.createDataFrame(rdd, schema) // Define udf import org.apache.spark.sql.functions.udf def udfScoreToCategory=udf((score: Int) => { score match { case t if t >= 80 => "A" case t if t >= 60 => "B" case t if t >= 35 => "C" case _ => "D" }}) df.withColumn("category", udfScoreToCategory(df("score"))).show(10)
As compared to earlier Hive version this is much more efficient as its uses combiners (so that we can do map side computation) and further stores only N records any given time both on the mapper and reducer side.
import heapq def takeOrderedByKey(self, num, sortValue = None, reverse=False): def init(a): return [a] def combine(agg, a): agg.append(a) return getTopN(agg) def merge(a, b): agg = a + b return getTopN(agg) def getTopN(agg): if reverse == True: return heapq.nlargest(num, agg, sortValue) else: return heapq.nsmallest(num, agg, sortValue) return self.combineByKey(init, combine, merge) # Create some fake student dataset. The objective is to use identify top 2 # students in each class based on GPA scores. data = [ ('ClassA','Student1', 3.89),('ClassA','Student2', 3.13),('ClassA', 'Student3',3.87), ('ClassB','Student1', 2.89),('ClassB','Student2', 3.13),('ClassB', 'Student3',3.97) ] # Add takeOrderedByKey function to RDD class from pyspark.rdd import RDD RDD.takeOrderedByKey = takeOrderedByKey # Load dataset rdd1 = sc.parallelize(data).map(lambda x: (x[0], x)) # extract top 2 records in each class ordered by GPA in descending order for i in rdd1.takeOrderedByKey(2, sortValue=lambda x: x[2], reverse=True).flatMap(lambda x: x[1]).collect(): print i
Output of the above program is:
('ClassB', 'Student3', 3.97) ('ClassB', 'Student2', 3.13) ('ClassA', 'Student1', 3.89) ('ClassA', 'Student3', 3.87)
The key line to understand is line number 22. We use combineByKey
operator to split the dataset by key and then use the heap data structure to order input records by GPA score. You can find a good explanation of combineByKey
operator on Adam Shinn’s blog.
Finally note that in line number 40, x
in sortValue = lambda x: x[2]
refers to the value of the PairRDD created at line number 37.
print 3 > 2 # True print [3] > [2] # True print [2,1] > [2] # True print (2,1) > (2,) # True print (2,2) > (2,2) # False
Below is an example of how to use the above information to sort RDD based on multiple fields and extract top N records. Basically we return a tuple as the key.
# load dataset data = sc.parallelize(...) # Order by Col 1 in Desc Order and then Col 0 in ascending order topN = data.takeOrdered(10, key=lambda x: (-1 * x[1], x[0]))
Code References:
1. takeOrdered: Note that it uses MaxHeapQ to collect elements and order them.
2. MaxHeapQ: Uses basic python comparison operator to determine the organize heap.
For simplicity, as indicated by the blue line in Fig A., let’s assume the target distribution from which you want random samples is a truncated normal distribution with -3 to 3 domain i.e . Also, as indicated by dashed black line, assume a rectangular enveloping region around the target distribution that is bounded by (-3, 0) and (3, 0.4167).
If randomly select x and y from this enveloping region and plot these points, as shown by red and green dots, some of the them will fall inside our target distribution and some outside of the target distribution. For any random point (x, y), if it falls within the target distribution i.e. then we accept it as a valid sample point. For instance assume the random point is (-1, 0.15). At X=-1, the probability density for the target distribution is given as . Since , we accept (-1, 0.15) as a valid point.
Let's convert the above idea into a working python code. Next we will look how to create this enveloping region around our target distribution.
import random import numpy as np import matplotlib.pyplot as plt import matplotlib.mlab as mlab from scipy.stats import truncnorm #Domain of X xdomain = [-3, 3] def pdf(x): """ Probability distribution function for Random Variable X from which we want to sample points. Here we assume we have truncated standard normal distribution in the domain of -3 to 3 """ return truncnorm.pdf(x, xdomain[0], xdomain[1]) def random_point_within_enveloping_region(): """ """ x = random.uniform(xdomain[0], xdomain[1]) y = random.uniform(0, 0.4167) return (x,y) #Number of sample points to sample n = 100 #Creating two arrays to capture accepted and rejected points accepted = [] rejected = [] #Run this loop until we got required number of valid points while len(accepted) < n: #Get random point x, y = random_point_within_enveloping_region() #If y is below blue curve then accept it if y < pdf(x): accepted.append((x, y)) #otherwise reject it. else: rejected.append((x, y)) #Plot the graph x = np.linspace(a, b, 100) plt.plot(x, [pdf(i) for i in x], color='blue') # Plot Random Variable X plt.plot(x, [0.4167 for i in x], color='black', ls='dashed', lw=2) # Plot Enveloping Region plt.plot([x[0] for x in accepted], [x[1] for x in accepted] , 'ro', color='g') # Plot Accepted Points plt.plot([x[0] for x in rejected], [x[1] for x in rejected] , 'ro', color='r') # Plot Rejected Points plt.show() #Calculate expected value for the truncated standard normal distribution approxMean = sum([x[0] for x in accepted])/len(accepted) print "Expected Mean = ", 0, pdf(0) print "Approximated Mean = ", approxMean, pdf(approxMean) print "Approximated Variance = ", sum([(x[0] - approxMean)**2 for x in accepted])/(len(accepted)-1)
Expected Mean = 0 0.400022258921 Approximated Mean = -0.0896272908375 0.398418781625 Approximated Variance = 1.17167227825
Above the term “enveloping region” was broadly used to indicate some distribution function. Any distribution (such as gaussian, uniform, etc) can be used as an enveloping distribution as long as following condition is meet:
.
That is at all possible value of x in the domain of the target distribution, probability of X=x based on the envelop distribution is more than that obtained by the target distribution. For instance for the truncated normal distribution at X=0, . Thus any enveloping distribution has to be more than 0.4 at point X = 0.
In the above example I assumed a uniform distribution ranging from -3 to 3 as the envelope distribution. For this envelope distribution, . However if we simply use this uniform distribution as it is, then we will violating the above condition A. For instance, since the envelop distribution is a uniform distribution, at X=0 .
To overcome this challenge, rejection sampling introduces a multiplier constant “M”. In the above example I used M = 2.5. The reason I used M = 2.5 because the max height for standard normal distribution is 0.4 at X = 0. For the given envelop distribution, P(X=0) = 0.167. Thus M = 0.167/0.4 = 2.3. Based on this multiplier condition, we can reformulate the condition for enveloping distribution as follows:
.
In practice, finding the max height of the target distribution can be challenging. Also we just want to make sure that at point condition B holds true. Hence in practice we start with random M. After each sample we make sure that the above condition holds (See line number 57-62 in the modified code below). If this is false then we increment M and restart the sampling from beginning.
import random import numpy as np import matplotlib.pyplot as plt import matplotlib.mlab as mlab from scipy.stats import truncnorm #Domain of X xdomain = [-3, 3] #Multiplier Constant M = 2.0 def pdf(x): """ Probability distribution function for Random Variable X from which we want to sample points. Here we assume we have truncated standard normal distribution in the domain of -3 to 3 """ return truncnorm.pdf(x, xdomain[0], xdomain[1]) def random_point_within_enveloping_region(): """ Return random point within the enveloping region. For x we will randomly sample point between -3 and 3 Since we are assuming uniform distribution, the height of the enveloping region at any x is 1/6. So for Y we randomly sample point between 0 and 1/6 """ #Randomly sample x from -3 to 3 x = random.uniform(xdomain[0], xdomain[1]) # probability of obtain any x is equal to 1/6. i.e. height of enveloping region # for any X is 1/6. y = random.uniform(0, M * 1.0/6.0 ) return (x,y) def height_of_enveloping_region(x): """Return height of enveloping region at x.""" return M * 1.0/6.0 #Number of sample points to sample n = 100 #Creating two arrays to capture accepted and rejected points accepted = [] rejected = [] M = 2.0 #Run this loop until we got required number of valid points while len(accepted) < n: #Get random point x, y = random_point_within_enveloping_region() #If for any x if envelping region is below the distribution from which we want to sample points #increment the multipler constant and resample all the points. if height_of_enveloping_region(x) < pdf(x): print "Increasing M from {0} to {1}".format(M, M+1) accepted = [] rejected = [] M += 1.0 continue #If y is below blue curve then accept it if y < pdf(x): accepted.append((x, y)) #otherwise reject it. else: rejected.append((x, y)) x = np.linspace(a, b, 100) plt.plot(x, [pdf(i) for i in x], color='blue') plt.plot(x, [1.0/6 for i in x], color='black', ls='dashed', lw=1) plt.plot(x, [M * 1.0/6 for i in x], color='black', ls='dashed', lw=2) plt.plot([x[0] for x in accepted], [x[1] for x in accepted] , 'ro', color='g') plt.plot([x[0] for x in rejected], [x[1] for x in rejected] , 'ro', color='r') plt.show() #Calculate expected value for the truncated standard normal distribution approxMean = sum([x[0] for x in accepted])/len(accepted) print "Expected Mean = ", 0, pdf(0) print "Approximated Mean = ", approxMean, pdf(approxMean) print "Approximated Variance = ", sum([(x[0] - approxMean)**2 for x in accepted])/(len(accepted)-1)
See example over here
Since we are throwing all the points that are outside of target distribution, one might question why not directly sample y in “random_point_within_enveloping_region” function between 0 and P_{target}(X=x) (instead of P_{Envelop}(X=x)). The problem with this approach is that x is not uniformly distributed in our target distribution. If we sample points from truncated normal distribution we are likely to get lot more x = 0 then x = -3. However in “random_point_within_enveloping_region” we are using uniform distribution to sample x.