Get up-to-date Real Exam Questions for Professional-Data-Engineer UPDATED [2023]
Pass Google Professional-Data-Engineer Exam in First Attempt Guaranteed
Google Cloud Big Data & Machine Learning Fundamentals course
This course is a gateway to introduce you to Google Cloud's big data and different machine learning functions. However, to successfully pass this training, you have to attain one year of experience in SQL, extract transform, data modeling, machine learning, programming in Python, and load activities. So, the objectives of the course are the following:
- Utilize Cloud SQL & Dataproc to migrate existing MySQL, Pig, Spark, or Hive workloads to Google Cloud
- Hire BigQuery and Cloud SQL for interactive data analysis
- Create ML models using BigQuery ML, APIs, and AutoML.
- Recognize the purpose of the key Big data and Machine Learning products in Google Cloud
NEW QUESTION 27
Google Cloud Bigtable indexes a single value in each row. This value is called the _______.
- A. master key
- B. unique key
- C. primary key
- D. row key
Answer: D
Explanation:
Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, allowing you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is known as the row key.
Reference: https://cloud.google.com/bigtable/docs/overview
NEW QUESTION 28
Which software libraries are supported by Cloud Machine Learning Engine?
- A. TensorFlow and Torch
- B. TensorFlow
- C. Theano and TensorFlow
- D. Theano and Torch
Answer: B
Explanation:
Cloud ML Engine mainly does two things:
Enables you to train machine learning models at scale by running TensorFlow training applications in the cloud.
Hosts those trained models for you in the cloud so that you can use them to get predictions
about new data.
NEW QUESTION 29
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world.
The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
* Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
* Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production - to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
* Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
* Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
* Provide reliable and timely access to data for analysis from distributed research workers
* Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m records/day Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure.
We also need environments in which our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
You need to compose visualizations for operations teams with the following requirements:
* The report must include telemetry data from all 50,000 installations for the most resent 6 weeks (sampling once every minute).
* The report must not be more than 3 hours delayed from live data.
* The actionable report should only show suboptimal links.
* Most suboptimal links should be sorted to the top.
* Suboptimal links can be grouped and filtered by regional geography.
* User response time to load the report must be <5 seconds.
Which approach meets the requirements?
- A. Load the data into Google Sheets, use formulas to calculate a metric, and use filters/sorting to show only suboptimal links in a table.
- B. Load the data into Google Cloud Datastore tables, write a Google App Engine Application that queries all rows, applies a function to derive the metric, and then renders results in a table using the Google charts and visualization API.
- C. Load the data into Google BigQuery tables, write a Google Data Studio 360 report that connects to your data, calculates a metric, and then uses a filter expression to show only suboptimal rows in a table.
- D. Load the data into Google BigQuery tables, write Google Apps Script that queries the data, calculates the metric, and shows only suboptimal rows in a table in Google Sheets.
Answer: B
NEW QUESTION 30
You are developing an application that uses a recommendation engine on Google Cloud. Your solution
should display new videos to customers based on past views. Your solution needs to generate labels for
the entities in videos that the customer has viewed. Your design must be able to provide very fast filtering
suggestions based on data from other customer preferences on several TB of data. What should you do?
- A. Build and train a classification model with Spark MLlib to generate labels. Build and train a second
classification model with Spark MLlib to filter results to match customer preferences. Deploy the
models using Cloud Dataproc. Call the models from your application. - B. Build an application that calls the Cloud Video Intelligence API to generate labels. Store data in Cloud
Bigtable, and filter the predicted labels to match the user's viewing history to generate preferences. - C. Build an application that calls the Cloud Video Intelligence API to generate labels. Store data in Cloud
SQL, and join and filter the predicted labels to match the user's viewing history to generate
preferences. - D. Build and train a complex classification model with Spark MLlib to generate labels and filter the results.
Deploy the models using Cloud Dataproc. Call the model from your application.
Answer: B
NEW QUESTION 31
Your company has hired a new data scientist who wants to perform complicated analyses across very large datasets stored in Google Cloud Storage and in a Cassandra cluster on Google Compute Engine. The scientist primarily wants to create labelled data sets for machine learning projects, along with some visualization tasks.
She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks. What should you do?
- A. Run a local version of Jupiter on the laptop.
- B. Host a visualization tool on a VM on Google Compute Engine.
- C. Grant the user access to Google Cloud Shell.
- D. Deploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.
Answer: C
NEW QUESTION 32
Which of these sources can you not load data into BigQuery from?
- A. Google Cloud SQL
- B. File upload
- C. Google Drive
- D. Google Cloud Storage
Answer: A
Explanation:
You can load data into BigQuery from a file upload, Google Cloud Storage, Google Drive, or Google Cloud Bigtable. It is not possible to load data into BigQuery directly from Google Cloud SQL. One way to get data from Cloud SQL to BigQuery would be to export data from Cloud SQL to Cloud Storage and then load it from there.
NEW QUESTION 33
You work for a global shipping company. You want to train a model on 40 TB of data to predict which ships in each geographic region are likely to cause delivery delays on any given day. The model will be based on multiple attributes collected from multiple sources. Telemetry data, including location in GeoJSON format, will be pulled from each ship and loaded every hour. You want to have a dashboard that shows how many and which ships are likely to cause delays within a region. You want to use a storage solution that has native functionality for prediction and geospatial processing. Which storage solution should you use?
- A. Cloud Bigtable
- B. Cloud SQL for PostgreSQL
- C. BigQuery
- D. Cloud Datastore
Answer: C
NEW QUESTION 34
What is the recommended action to do in order to switch between SSD and HDD storage for your Google Cloud Bigtable instance?
- A. the selection is final and you must resume using the same storage type
- B. export the data from the existing instance and import the data into a new instance
- C. run parallel instances where one is HDD and the other is SDD
- D. create a third instance and sync the data from the two storage types via batch jobs
Answer: B
Explanation:
When you create a Cloud Bigtable instance and cluster, your choice of SSD or HDD storage for the cluster is permanent. You cannot use the Google Cloud Platform Console to change the type of storage that is used for the cluster.
If you need to convert an existing HDD cluster to SSD, or vice-versa, you can export the data from the existing instance and import the data into a new instance. Alternatively, you can write
a Cloud Dataflow or Hadoop MapReduce job that copies the data from one instance to another.
NEW QUESTION 35
You need to copy millions of sensitive patient records from a relational database to BigQuery. The total size of the database is 10 TB. You need to design a solution that is secure and time-efficient. What should you do?
- A. Export the records from the database as an Avro file. Copy the file onto a Transfer Appliance and send it to Google, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
- B. Export the records from the database as an Avro file. Create a public URL for the Avro file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
- C. Export the records from the database into a CSV file. Create a public URL for the CSV file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the CSV file into BigQuery using the BigQuery web UI in the GCP Console.
- D. Export the records from the database as an Avro file. Upload the file to GCS using gsutil, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
Answer: D
Explanation:
Google recommends that enterprises use Transfer Appliance in cases where it would take them over a week to upload data to the cloud via the internet, or when an enterprise needs to migrate over 60 TB of data.
NEW QUESTION 36
When using Cloud Dataproc clusters, you can access the YARN web interface by configuring a browser to connect through a ____ proxy.
- A. SOCKS
- B. VPN
- C. HTTP
- D. HTTPS
Answer: A
Explanation:
Explanation
When using Cloud Dataproc clusters, configure your browser to use the SOCKS proxy. The SOCKS proxy routes data intended for the Cloud Dataproc cluster through an SSH tunnel.
Reference: https://cloud.google.com/dataproc/docs/concepts/cluster-web-interfaces#interfaces
NEW QUESTION 37
Your company is using WHILECARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the following error:
# Syntax error : Expected end of statement but got "-" at [4:11] SELECT age FROM bigquery-public-data.noaa_gsod.gsod WHERE age != 99 AND_TABLE_SUFFIX = `1929' ORDER BY age DESC Which table name will make the SQL statement work correctly?
- A. bigquery-public-data.noaa_gsod.gsod*
- B. `bigquery-public-data.noaa_gsod.gsod'*
- C. `bigquery-public-data.noaa_gsod.gsod*`
- D. `bigquery-public-data.noaa_gsod.gsod`
Answer: A
NEW QUESTION 38
When you design a Google Cloud Bigtable schema it is recommended that you _________.
- A. Avoid schema designs that are based on NoSQL concepts
- B. Create schema designs that require atomicity across rows
- C. Avoid schema designs that require atomicity across rows
- D. Create schema designs that are based on a relational database design
Answer: C
Explanation:
All operations are atomic at the row level. For example, if you update two rows in a table, it's possible that one row will be updated successfully and the other update will fail. Avoid schema designs that require atomicity across rows.
Reference: https://cloud.google.com/bigtable/docs/schema-design#row-keys
NEW QUESTION 39
Which of these is NOT a way to customize the software on Dataproc cluster instances?
- A. Modify configuration files using cluster properties
- B. Log into the master node and make changes from there
- C. Set initialization actions
- D. Configure the cluster using Cloud Deployment Manager
Answer: D
NEW QUESTION 40
How would you query specific partitions in a BigQuery table?
- A. Use the __PARTITIONTIME pseudo-column in the WHERE clause
- B. Use DATE BETWEEN in the WHERE clause
- C. Use the EXTRACT(DAY) clause
- D. Use the DAY column in the WHERE clause
Answer: A
Explanation:
Partitioned tables include a pseudo column named _PARTITIONTIME that contains a date- based timestamp for data loaded into the table. To limit a query to particular partitions (such as Jan 1st and 2nd of 2017), use a clause similar to this:
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2017-01-01') AND
TIMESTAMP('2017-01-02')
Reference: https://cloud.google.com/bigquery/docs/partitioned-
tables#the_partitiontime_pseudo_column
NEW QUESTION 41
A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real time.
This is then loaded into BigQuery. Analysts in your company want to query the tracking data in BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created with ingest-date partitioning. Over time, the query processing time has increased. You need to implement a change that would improve query performance in BigQuery. What should you do?
- A. Tier older data onto Cloud Storage files, and leverage extended tables.
- B. Implement clustering in BigQuery on the ingest date column.
- C. Re-create the table using data partitioning on the package delivery date.
- D. Implement clustering in BigQuery on the package-tracking ID column.
Answer: D
NEW QUESTION 42
You are managing a Cloud Dataproc cluster. You need to make a job run faster while minimizing costs, without losing work in progress on your clusters. What should you do?
- A. Increase the cluster size with more non-preemptible workers.
- B. Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.
- C. Increase the cluster size with preemptible worker nodes, and configure them to forcefully decommission.
- D. Increase the cluster size with preemptible worker nodes, and use Cloud Stackdriver to trigger a script to preserve work.
Answer: B
Explanation:
Explanation/Reference:
Reference https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/flex
NEW QUESTION 43
Which row keys are likely to cause a disproportionate number of reads and/or writes on a particular node in a Bigtable cluster (select 2 answers)?
- A. A sequential numeric ID
- B. A stock symbol followed by a timestamp
- C. A non-sequential numeric ID
- D. A timestamp followed by a stock symbol
Answer: A,D
Explanation:
using a timestamp as the first element of a row key can cause a variety of problems.
In brief, when a row key for a time series includes a timestamp, all of your writes will target a single node; fill that node; and then move onto the next node in the cluster, resulting in hotspotting.
Suppose your system assigns a numeric ID to each of your application's users. You might be tempted to use the user's numeric ID as the row key for your table. However, since new users are more likely to be active users, this approach is likely to push most of your traffic to a small number of nodes. [https://cloud.google.com/bigtable/docs/schema-design] Reference: https://cloud.google.com/bigtable/docs/schema-design-time- series#ensure_that_your_row_key_avoids_hotspotting
NEW QUESTION 44
You are running a pipeline in Cloud Dataflow that receives messages from a Cloud Pub/Sub topic and writes the results to a BigQuery dataset in the EU. Currently, your pipeline is located in europe-west4 and has a maximum of 3 workers, instance type n1-standard-1. You notice that during peak periods, your pipeline is struggling to process records in a timely fashion, when all 3 workers are at maximum CPU utilization. Which two actions can you take to increase performance of your pipeline? (Choose two.)
- A. Change the zone of your Cloud Dataflow pipeline to run in us-central1
- B. Create a temporary table in Cloud Spanner that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Spanner to BigQuery
- C. Increase the number of max workers
- D. Use a larger instance type for your Cloud Dataflow workers
- E. Create a temporary table in Cloud Bigtable that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Bigtable to BigQuery
Answer: B,D
Explanation:
Explanation/Reference:
NEW QUESTION 45
Your company handles data processing for a number of different clients. Each client prefers to use their own suite of analytics tools, with some allowing direct query access via Google BigQuery. You need to secure the data so that clients cannot see each other's dat
a. You want to ensure appropriate access to the data. Which three steps should you take? (Choose three.)
- A. Load data into a different dataset for each client.
- B. Only allow a service account to access the datasets.
- C. Load data into different partitions.
- D. Put each client's BigQuery dataset into a different table.
- E. Use the appropriate identity and access management (IAM) roles for each client's users.
- F. Restrict a client's dataset to approved users.
Answer: A,E,F
NEW QUESTION 46
Your globally distributed auction application allows users to bid on items. Occasionally, users place identical bids at nearly identical times, and different application servers process those bids. Each bid event contains the item, amount, user, and timestamp. You want to collate those bid events into a single location in real time to determine which user bid first. What should you do?
- A. Set up a MySQL database for each application server to write bid events into. Periodically query each of those distributed MySQL databases and update a master MySQL database with bid event information.
- B. Have each application server write the bid events to Cloud Pub/Sub as they occur. Push the events from Cloud Pub/Sub to a custom endpoint that writes the bid event information into Cloud SQL.
- C. Create a file on a shared file and have the application servers write all bid events to that file. Process the file with Apache Hadoop to identify which user bid first.
- D. Have each application server write the bid events to Google Cloud Pub/Sub as they occur. Use a pull
Answer: A
Explanation:
subscription to pull the bid events using Google Cloud Dataflow. Give the bid for each item to the user in
the bid event that is processed first.
NEW QUESTION 47
You are a retailer that wants to integrate your online sales capabilities with different in-home assistants, such as Google Home. You need to interpret customer voice commands and issue an order to the backend systems.
Which solutions should you choose?
- A. Dialogflow Enterprise Edition
- B. Cloud AutoML Natural Language
- C. Cloud Speech-to-Text API
- D. Cloud Natural Language API
Answer: A
NEW QUESTION 48
You have some data, which is shown in the graphic below. The two dimensions are X and Y, and the shade of each dot represents what class it is. You want to classify this data accurately using a linear algorithm. To do this you need to add a synthetic feature. What should the value of that feature be?
- A. Y^2
- B. X^2
- C. cos(X)
- D. X^2+Y^2
Answer: C
NEW QUESTION 49
......
Google Professional-Data-Engineer Study Guide Archives : https://actualtests.vceengine.com/Professional-Data-Engineer-vce-test-engine.html
