ML Study Jam — Forecasting model in BQML
Predict Taxi Fare with a BigQuery ML Forecasting Model
What is ML Study Jam?
A free machine learning program using Qwiklabs.
What is BigQuery?
BigQuery is Google’s fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage, or needing a database administrator.
BigQuery Machine Learning (BQML, product in beta) is a new feature in BigQuery where data analysts can create, train, evaluate, and predict with machine learning models with minimal coding.
The flow of BigQuery ML Forecasting

The flow of ML forecasting is like the flowchart above.
- Select valuable fields from your raw data table for predictions.
- Choose a suitable
classfication model
to train your model. - Evaluate the model performance.
- If the model performance is not good enough, try to filter the data or change the needed table column.
- If the model is good enough, use it to predict data.
- Get the predictions.
Predict Taxi Fare with a BigQuery ML Forecasting Model
In this lab, you will explore millions of New York City yellow taxi cab trips available in a BigQuery Public Dataset. You will then create a machine learning model inside of BigQuery to predict the fare of the cab ride given your model inputs. Lastly, you will evaluate the performance of your model and make predictions with it.
The lesson we’re going to learn is to predict the total_fare
by using the New York city taxi open data.
New York City Yellow Taxi Trips Record
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Select features and create your training dataset
This is the first step of the flowchart, we select the below fields from our table as good inputs to our fare forecasting model.
Tolls Amount
Fare Amount
Days of the week
Hour of Day
Pick up address
Drop off address
Number of passengers

Now, we’ve got all the columns we need and the total_fare
is what we want to predict later.
Select a BQML model type and specify options

We pick Forecasting
for predicting total_fare
because we want to get a numeric value
as a prediction result. The SQL query would be like this:

And the result would be:

Evaluate classification model performance
For linear regression models we want to use a loss metric like Root Mean Square Error (RMSE). Try to keep training and improving the model until it has the lowest RMSE.
In BQML,
mean_squared_error
is a queryable field when evaluating your trained ML model. Add aSQRT()
to get RMSE.Then we can evaluate how well the model performs with this query using
ML.EVALUATE
.


Predict taxi fare amount
Now we can use the new model to make predictions.


It seems like the predictions are not good enough for us. Let’s improve it.
Improving the model with Feature Engineering
We can filter the training dataset by setting a possible range of fare_amount
. First, query the table to see if there are any incorrect values.


There are some strange outliers like negative fares or fares over $50000.
Let’s limit the data to only fares between $6 and $200.


Use the new dataset to train the model.

and get a better RMSE after then.


That’s all for this lab, next time we’ll learn about how to use BQML model type logistic_reg
with the article: