ML Study Jam — Forecasting model in BQML
What is ML Study Jam?
A free machine learning program using Qwiklabs.
What is BigQuery?
BigQuery is Google’s fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage, or needing a database administrator.
BigQuery Machine Learning (BQML, product in beta) is a new feature in BigQuery where data analysts can create, train, evaluate, and predict with machine learning models with minimal coding.
The flow of BigQuery ML Forecasting
The flow of ML forecasting is like the flowchart above.
- Select valuable fields from your raw data table for predictions.
- Choose a suitable
classfication modelto train your model.
- Evaluate the model performance.
- If the model performance is not good enough, try to filter the data or change the needed table column.
- If the model is good enough, use it to predict data.
- Get the predictions.
Predict Taxi Fare with a BigQuery ML Forecasting Model
In this lab, you will explore millions of New York City yellow taxi cab trips available in a BigQuery Public Dataset. You will then create a machine learning model inside of BigQuery to predict the fare of the cab ride given your model inputs. Lastly, you will evaluate the performance of your model and make predictions with it.
The lesson we’re going to learn is to predict the
total_fare by using the New York city taxi open data.
New York City Yellow Taxi Trips Record
Select features and create your training dataset
This is the first step of the flowchart, we select the below fields from our table as good inputs to our fare forecasting model.
Days of the week
Hour of Day
Pick up address
Drop off address
Number of passengers
Now, we’ve got all the columns we need and the
total_fare is what we want to predict later.
Select a BQML model type and specify options
Forecasting for predicting
total_fare because we want to get a
numeric value as a prediction result. The SQL query would be like this:
And the result would be:
Evaluate classification model performance
For linear regression models we want to use a loss metric like Root Mean Square Error (RMSE). Try to keep training and improving the model until it has the lowest RMSE.
mean_squared_erroris a queryable field when evaluating your trained ML model. Add a
SQRT()to get RMSE.
Then we can evaluate how well the model performs with this query using
Predict taxi fare amount
Now we can use the new model to make predictions.
It seems like the predictions are not good enough for us. Let’s improve it.
Improving the model with Feature Engineering
We can filter the training dataset by setting a possible range of
fare_amount. First, query the table to see if there are any incorrect values.
There are some strange outliers like negative fares or fares over $50000.
Let’s limit the data to only fares between $6 and $200.
Use the new dataset to train the model.
and get a better RMSE after then.
That’s all for this lab, next time we’ll learn about how to use BQML model type
logistic_reg with the article: