How I pass the GCP Data Engineer exam - Key points to study and personal experience

7 min readJan 5, 2020

It’s a pretty honest and intense exam. It was a challenge that I had in mind for a while and after the end of the year I encouraged to take the exam and here’s the result:

The idea of this post is to share some key points, notes, and study materials for the exam.

There were some topics that I didn’t find in other posts related to the Data Engineer certification so that’s the idea of this. Take it as a complimentary note for your cheat sheet, course, etc.

First of all, I work as a Data Architect and I have almost 8 months of experience with GCP so, for some questions, that was really helpful because the answers are really tricky and you can doubt yourself on almost every question, the idea is to stay calm, mark the most reasonable one and review the question later. The main idea is to don’t stuck on a question, remember that time is limited. For me, it took me almost 1 hour and a half to complete the test (you have 2 hours).

Machine Learning

The exam (questions are random so I don’t know about other experience) was really heavy on Machine Learning topics. There was like 15 question about this. Mostly related to:

Overfit models
How to deal with high RMSE, for example: make your model more complex and robust.
Neurons, features, epoch, labels.
L1 and L2 regularization
Dialogflow
Cloud AutoML to label some logos within an image.
Cloud Vision API, Speech to Text, etc.
Tensorflow models in C++
Cloud TPU and GPU

BigQuery

The exam was heavy on BigQuery questions too. Questions related to Updates on Bigquery, IAM roles, Slots, Storage, etc.

Update DML: How to use Update DML in BigQuery, how you can handle quota error exceded in your project. How many simultaneous updates can you run on a daily basis? Best ways to update a table (for example, if you have a partitioned table, etc.).
Authorized View: I recommend studying all about authorized views, how you can share queries to your data science team without the necessity of query the entire columns of a table.
Allocated Slots and Available Slots: What can you do if you don’t have more slots available and you don’t want to create a new project in your organization?
Storage: Questions related to the best way to store raw data, for example between BigQuery or Storage. This will depend on the context of the question and if price or performance is the priority.
BigQuery Data Transfer Service and the connection available with BI tools.
Partition and clustering
HASH, Merge, Data manipulation. 2 questions related to this.

BigTable

I have no professional experience working with BigTable so the knowledge that I have was mostly theoretical.

There were 2 questions related to Row key performance and how you can update your cluster if the performance is not optimal due to high reads or write.
How you can scale your cluster and synchronize the data.
Single-cluster routing and multi-cluster routing.
Key Visualizer Metrics

Spanner

Knowing about default indexes and secondary indexes is a must and when to choose Spanner over Datastore, Bigtable, or CloudSQL.

Regional configuration and replicas.
Monitoring CPU

CloudSQL

I remember just one question about CloudSQL. I recommend knowing how to export data from CloudSQL to BigQuery, on-premises databases to CloudSQL, Cloud SQL HA, and read replicas.

Datastore

Again, the key here is when to choose DataStore over other databases like CloudSQL, BigTable, BigQuery, etc.

How you can export data from DataStore to BigQuery.
Replicas between other projects.
Multiple indexes and syntax to create composite indexes are going to be really helpful.

Dataflow

This topic was really technical and if you don’t have experience working with Dataflow it may be a little bit tricky.

How to discard erroneous data and for example sent it to Pub/Sub or Cloud Storage
Transform, DoFn, Sideinputs, Sideoutputs.
IAM Roles for Developers and how to secure the data.
Windows, all kinds of them. There were 3 questions about Sliding time windows, Session time windows, the best way to deal with late data.
Bounded and unbounded data.
How to connect with Pub/Sub, BigQuery, BigTable, etc.

Pub/Sub

I recommend understanding the difference between push and pull and what you need to implement a push solution.

This service is glue with other Cloud components so there was some question related to Pub/Sub / Dataflow / BigQuery implementation.

Streaming and how to implement this solution with Dataflow
Globally Unique Identifier (GUID)
Handle subscriber code errors
How to connect Kafka to Pub/Sub
How to know when your topic is currently not working well. This is mostly related to Stackdriver Monitoring.

Dataproc

This was pretty heavy on on-prem Hadoop implementations and how to migrate to GCP.

Migrate jobs to the cloud.
Which role needs the service account to work properly with Dataproc (Dataproc Worker).
SOCKS and YARN for web Interface.
Custom images.
Use Storage instead of HDFS.
Always remember that Google recommends one cluster for one task. If you need analytics and transactional solutions with Dataproc, it’s better to create two clusters for that kind of implementation.

IAM Roles

It’s going to be useful to know the most important roles for every Service

Different between jobUser role and User for BigQuery.
Dataproc Worker, Dataflow Developer.
Billing Administrator Role Account.
Difference between Writer and Reader role for BigQuery.
Which roles can you administrate for Pub/Sub service?
Aggregated logs for multiple projects.
Dealing with roles cross projects, what are the best practice for that, create a group of users for those projects? hierarchy? Service Accounts for Cloud Storage and BigQuery?

Cloud Storage

Here, it’s a must to know the differences between every class in Storage. The exam questions were really tricky between cold line and nearline implementations.

Most questions were related to how to secure raw data for audit.
Data Transfer vs Storage transfer service.
How you can stay in sync with on-prem storage if the on-prem storage doesn't allow any other IP from outside?

Composer

This cloud component is quite easy to figure out. Remember that the Cloud Composer environment runs Airflow and Airflow itself is an Orchestrator tool. So when you want to integrate some Dataflow jobs with Dataproc jobs and there’s a dependency on each other. Always the best solution is going to be Cloud Composer.

DataStudio

Study the difference between Viewer credentials and Owner credentials if you want to share some dashboards.

Default caching and prefetch caching.
How to connect BigQuery with DataStudio and other services like Youtube.

Dataprep

There was some question related to Dataprep, for example, if you want a quite easy implementation to deal with outliers what’s the best tool for that? transform recipes, and finally, how to schedule a Dataprep implementation, do you need Cloud Scheduler for that or you can do it directly from the Dataprep UI?

Tips

Besides all the key points mentioned. I found this exam heavily focused on Machine Learning, so I recommend learning all the most important vocabulary for that, like labels, epoch, neurons, hidden layers, bias, weight, learn when to implement a linear regression model instead of classification or clustering.

It’s always good to read the documentation and stay updated on every new feature added to these Cloud Services, for example, at the moment you can deploy TensorFlow, Scikit-learn, and XGBoost models to AI Platform but who knows what’s going to be in the future, the same can be applied to Machine Learning models for BigQuery.

Another good key point is to know about Stackdriver Monitor, Stackdriver Logging, Stackdriver Logging Agent, etc. I remember that there was a question about installing the Stackdriver Logging agent for a MariaDB on Compute Engine, so, that can be helpful too.

Courses

Coursera - 5/10: If you don't have experience with GCP, start with this course. It’s a really good introduction to GCP but IMO doesn't prepare you for the exam.

Linux Academy - 7/10: This course is focused to prepare you for the exam but it doesn't go too deep into every service related to the exam, still, it’s a must IMO to take this course and then read the documentation.

Documentation - 8/10: The official documentation really helped me to go deeper into some topics, like clustering configuration for BigTable, Machine Learning models for BigQuery, Dataproc, Dataflow, etc. I recommend giving you time to read the documentation because is quite heavy and it’s going to take a while but the compensation for that is big.

Practice exams

BrainCert - 8/10: I would say this practice exams course is the best out there. The questions really prepare you for the real exam and you are going to gain more knowledge after reading the explanation for each answer and question. I did two tests almost every day for about 2 weeks until reaching between 94-100% for every test.

I hope that this post will be helpful for all of you that are going to take the exam in the future. Please let me know if you have any questions!

You can always reach me on:

Twitter: https://twitter.com/StriderKeni
Github: https://github.com/StriderKeni