ML Study Jam —Classification model in BQML
Predict Visitor Purchases with a Classification Model in BQML
Prerequisite
Before we get started, I recommend you read the previous article first.
ML Study Jam — Predict Taxi Fare with a BigQuery ML Forecasting Model
What is ML Study Jam?
medium.com
The explanation of the BigQuery ML Forecasting flow will help you to know about what we could do with BigQuery lab.
Dataset — Google Analytics logs
Click the link below will automatically import the dataset into your BigQuery console. This is an ecommerce GA data from the Google website during 2016/08/01- 2017/08/01.
Today we’re going to predict whether or not a new user is likely to purchase in the future by the data. And here are some useful links:
1. The definition of the fields.
2. The preview of the demo dataset.
Explore ecommerce data
Question: Out of the total visitors who visited our website, what % made a purchase?

Answer:
1. total_visitors => the count of the fullVisitorId
2. total_purchasers => the count of totals.transaction is not null
out of the total visitors.

so the result is (total_purchasers / total_visitors) x 100% = 2.69%
Question: What are the top 5 selling products?

Answer:
We filter the data by decreasing, order by the revenue
. And then we could figure out the top 5 best selling products, all come from Google Nest. 😆



Question: How many visitors bought on subsequent visits to the website?

Answer:
We can find out the answer from all visitors with total.transactions > 0
(means they did buy something) and totals.newVisits is null
(means they are not first time visiting the website). The answers will be 11873 visitors.
we can see that (11873 / 729848) = 1.6% of total visitors will return and purchase from the website.
Select features and create your training dataset
Your team decides to test whether these two fields are good inputs for your classification model:
totals.bounces
(whether the visitor left the website immediately)

2. totals.timeOnSite
(how long the visitor was on our website)

(Maybe these two fields are mutually exclusive? 🤔)

We could see the result of the query as below. At first glance out of the top 10 time_on_site
, only 1 customer returned to buy. It seems that the two fields we picked are not much relevant to the field will_buy_on_return_visit
, but it is often too early to tell before training and evaluating the model.

Select a BQML model type and specify options

BQML supports three types of training model. We pick Binary Classification
for this case, because the prediction label will_buy_on_return_visit
is a true/false question.

And the result would be

Evaluate classification model performance
For classification problems in ML, you want to minimize the
False Positive Rate (predict that the user will return and purchase and they don’t)and maximize the
True Positive Rate (predict that the user will return and purchase and they do).This relationship is visualized with a ROC (Receiver Operating Characteristic) curve like the one shown here, where you try to maximize the area under the curve or AUC(Area under the Curve of ROC (AUC ROC)):

In BQML, roc_auc is simply a queryable field when evaluating your trained ML model.
Now that training is complete, run this query to evaluate how well the model performs using
ML.EVALUATE
:


Improve model performance with Feature Engineering
We found the roc_auc
is about 0.725
and the quality of the model is only decent. Now let’s improve the model performance by adding more fields.
- How far the visitor got in the checkout process on their first visit
hits.eCommerceAction.action_type
1 = Click through of product lists
2 = Product detail views
3 = Add product(s)
4 = Remove product(s)
5 = Check out
6 = Completed purchase
7 = Refund of purchase
8 = Checkout options
0 = Unknown - Where the visitor came from?
trafficSource: organic search, referring site etc..



- Device category (mobile, tablet, desktop)

- Geographic information (country)

Train the new model:


The new fields really enhance the model quality, now the roc_auc
is up to 0.91
.
Predict which new visitors will come back and purchase
Predict the probability that a first-time visitor to the Google Merchandise Store will make a purchase in a later visit:


Your model now outputs its predictions for those July 2017 (2017/07/01–2017/08/01) ecommerce sessions. You can see three newly added fields:
- predicted_will_buy_on_return_visit: whether the model thinks the visitor will buy later (1 = yes)
- predicted_will_buy_on_return_visit_probs.label: the binary classifier for yes / no
- predicted_will_buy_on_return_visit.prob: the confidence the model has in it’s prediction (1 = 100%)
Results
Of the top 6% of first-time visitors (sorted in decreasing order of predicted probability), more than 6% make a purchase in a later visit.
These users represent nearly 50% of all first-time visitors who make a purchase in a later visit.
Overall, only 0.7% of first-time visitors make a purchase in a later visit.
Targeting the top 6% of first-time increases marketing ROI by 9x vs targeting them all!
Additional information
Tip: add
warm_start = true
to your model options if you are retraining new data on an existing model for faster training times. Note that you cannot change the feature columns (this would necessitate a new model).roc_auc is just one of the performance metrics available during model evaluation. Also available are accuracy, precision, and recall. Knowing which performance metric to rely on is highly dependent on what your overall objective or goal is.
References
KKNews — ROC曲線 https://kknews.cc/zh-tw/military/ymk26xb.html
Final
Thanks to ML Study Jam, most of the pictures in this article are from it.
Hopefully, this article helps you understand how to predict a true/false question by using BigQuery lab.
You can also read the other BigQuery article.
ML Study Jam — Predict Taxi Fare with a BigQuery ML Forecasting Model
What is ML Study Jam?
medium.com
In the next post, I’ll introduce the Detect Labels, Faces, and Landmarks in Images with the Cloud Vision API
.