Databricks-Certified-Professional-Data-Engineer Sample Questions Answers

Questions 4

Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

Options:

configure

jobs

libraries

workspace

Buy Now

Questions 5

The data engineer team has been tasked with configured connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user group already created in Databricks that represent various teams within the company.

A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.

Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?

Options:

‘’Read’’ permissions should be set on a secret key mapped to those credentials that will be used by a given team.

No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.

“Read” permissions should be set on a secret scope containing only those credentials that will be used by a given team.

“Manage” permission should be set on a secret scope containing only those credentials that will be used by a given team.

Buy Now

Questions 6

A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're using a notebook versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.

Which approach will allow this developer to review the current logic for this notebook?

Options:

Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9

Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.

Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch

Merge all changes back to the main branch in the remote Git repository and clone the repo again

Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository

Buy Now

Questions 7

The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.

Which approach will ensure that this requirement is met?

Options:

Whenever a database is being created, make sure that the location keyword is used

When configuring an external data warehouse for all table storage. leverage Databricks for all ELT.

Whenever a table is being created, make sure that the location keyword is used.

When tables are created, make sure that the external keyword is used in the create table statement.

When the workspace is being configured, make sure that external cloud object storage has been mounted.

Buy Now

Questions 8

A platform engineer is creating catalogs and schemas for the development team to use.

The engineer has created an initial catalog, catalog_A, and initial schema, schema_A. The engineer has also granted USE CATALOG, USE

SCHEMA, and CREATE TABLE to the development team so that the engineer can begin populating the schema with new tables.

Despite being owner of the catalog and schema, the engineer noticed that they do not have access to the underlying tables in Schema_A.

What explains the engineer's lack of access to the underlying tables?

Options:

The platform engineer needs to execute a REFRESH statement as the table permissions did not automatically update for owners.

Users granted with USE CATALOG can modify the owner's permissions to downstream tables.

The owner of the schema does not automatically have permission to tables within the schema, but can grant them to themselves at any point.

Permissions explicitly given by the table creator are the only way the Platform Engineer could access the underlying tables in their

schema.

Buy Now

Answer:

Explanation:

In Databricks, catalogs, schemas (or databases), and tables are managed through the Unity Catalog or Hive Metastore, depending on the environment. Permissions and ownership within these structures are governed by access control lists (ACLs).

Catalog and Schema Ownership: When a platform engineer creates a catalog (such as catalog_A) and schema (such as schema_A), they automatically become the owner of those entities. This ownership gives them control over granting permissions for those entities (i.e., granting the USE CATALOG and USE SCHEMA privileges to others). However, ownership of the catalog or schema does not automatically extend to ownership or permission of individual tables within that schema.

Table Permissions: For tables within a schema, the permission model is more granular. The table creator (i.e., whoever creates the table) is automatically assigned as the owner of that table. In this case, the platform engineer owns the schema but does not automatically inherit permissions to any table created within the schema unless explicitly granted by the table's owner or unless they grant permissions to themselves.

Why the Engineer Lacks Access: The platform engineer notices that they do not have access to the underlying tables in schema_A despite being the owner of the schema. This occurs because the schema's ownership does not cascade to the tables. The engineer must either:

Grant permissions to themselves for the tables in schema_A, or

Be granted permissions by whoever created the tables within the schema.

Resolution: As the owner of the schema, the platform engineer can easily grant themselves the required permissions (such as SELECT, INSERT, etc.) for the tables in the schema. This explains why the owner of a schema may not automatically have access to the tables and must take explicit steps to acquire those permissions.

References

Databricks Unity Catalog Documentation: Manage Permissions

[Databricks Permissions and Ownership](https://docs.databricks.com/security/access-control/workspace-acl.html#permissions

Questions 9

Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?

Options:

In the Executor's log file, by gripping for "predicate push-down"

In the Stage's Detail screen, in the Completed Stages table, by noting the size of data read from the Input column

In the Storage Detail screen, by noting which RDDs are not stored on disk

In the Delta Lake transaction log. by noting the column statistics

In the Query Detail screen, by interpreting the Physical Plan

Buy Now

Questions 10

An analytics team wants to run a short-term experiment in Databricks SQL on the customer transactions Delta table (about 20 billion records) created by the data engineering team. Which strategy should the data engineering team use to ensure minimal downtime and no impact on the ongoing ETL processes?

Options:

Create a new table for the analytics team using a CTAS statement.

Deep clone the table for the analytics team.

Give the analytics team direct access to the production table.

Shallow clone the table for the analytics team.

Buy Now

Questions 11

A junior data engineer on your team has implemented the following code block.

The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table.

When this query is executed, what will happen with new records that have the same event_id as an existing record?

Options:

They are merged.

They are ignored.

They are updated.

They are inserted.

They are deleted.

Buy Now

Questions 12

Which of the following is true of Delta Lake and the Lakehouse?

Options:

Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.

Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.

Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.

Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.

Z-order can only be applied to numeric values stored in Delta Lake tables

Buy Now

Answer:

Explanation:

https://docs.delta.io/2.0.0/table-properties.html

Delta Lake automatically collects statistics on the first 32 columns of each table, which are leveraged in data skipping based on query filters1. Data skipping is a performance optimization technique that aims to avoid reading irrelevant data from the storage layer1. By collecting statistics such as min/max values, null counts, and bloom filters, Delta Lake can efficiently prune unnecessary files or partitions from the query plan1. This can significantly improve the query performance and reduce the I/O cost.

The other options are false because:

Parquet compresses data column by column, not row by row2. This allows for better compression ratios, especially for repeated or similar values within a column2.

Views in the Lakehouse do not maintain a valid cache of the most recent versions of source tables at all times3. Views are logical constructs that are defined by a SQL query on one or more base tables3. Views are not materialized by default, which means they do not store any data, but only the query definition3. Therefore, views always reflect the latest state of the source tables when queried3. However, views can be cached manually using the CACHE TABLE or CREATE TABLE AS SELECT commands.

Primary and foreign key constraints can not be leveraged to ensure duplicate values are never entered into a dimension table. Delta Lake does not support enforcing primary and foreign key constraints on tables. Constraints are logical rules that define the integrity and validity of the data in a table. Delta Lake relies on the application logic or the user to ensure the data quality and consistency.

Z-order can be applied to any values stored in Delta Lake tables, not only numeric values. Z-order is a technique to optimize the layout of the data files by sorting them on one or more columns. Z-order can improve the query performance by clustering related values together and enabling more efficient data skipping. Z-order can be applied to any column that has a defined ordering, such as numeric, string, date, or boolean values.

[References: Data Skipping, Parquet Format, Views, [Caching], [Constraints], [Z-Ordering], ]

Questions 13

The data engineer is using Spark's MEMORY_ONLY storage level.

Which indicators should the data engineer look for in the spark UI's Storage tab to signal that a cached table is not performing optimally?

Options:

Size on Disk is> 0

The number of Cached Partitions> the number of Spark Partitions

The RDD Block Name included the '' annotation signaling failure to cache

On Heap Memory Usage is within 75% of off Heap Memory usage

Buy Now

Questions 14

A Delta Lake table with Change Data Feed (CDF) enabled in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources. The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours. Which approach would simplify the identification of these changed records?

Options:

Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.

Modify the overwrite logic to include a field populated by calling current_timestamp() as data are being written; use this field to identify records written on a particular date.

Replace the current overwrite logic with a MERGE statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the Change Data Feed.

Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.

Buy Now

Questions 15

Which distribution does Databricks support for installing custom Python code packages?

Options:

sbt

CRAN

CRAM

nom

Wheels

jars

Buy Now

Questions 16

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.

Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

Options:

Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.

Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.

The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.

Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.

Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

Buy Now

Answer:

Explanation:

The scenario presented involves inconsistent microbatch processing times in a Structured Streaming job during peak hours, with the need to ensure that records are processed within 10 seconds. The trigger once option is the most suitable adjustment to address these challenges:

Understanding Triggering Options:

Fixed Interval Triggering (Current Setup): The current trigger interval of 10 seconds may contribute to the inconsistency during peak times as it doesn't adapt based on the processing time of the microbatches. If a batch takes longer to process, subsequent batches will start piling up, exacerbating the delays.

Trigger Once: This option allows the job to run a single microbatch for processing all available data and then stop. It is useful in scenarios where batch sizes are unpredictable and can vary significantly, which seems to be the case during peak hours in this scenario.

Implementation of Trigger Once:

Setup: Instead of continuously running, the job can be scheduled to run every 10 seconds using a Databricks job. This scheduling effectively acts as a custom trigger interval, ensuring that each execution cycle handles all available data up to that point without overlapping or queuing up additional executions.

Advantages: This approach allows for each batch to complete processing all available data before the next batch starts, ensuring consistency in handling data surges and preventing the system from being overwhelmed.

Rationale Against Other Options:

Option A and E (Decrease Interval): Decreasing the trigger interval to 5 seconds might exacerbate the problem by increasing the frequency of batch starts without ensuring the completion of previous batches, potentially leading to higher overhead and less efficient processing.

Option B (Increase Interval): Increasing the trigger interval to 30 seconds could lead to latency issues, as the data would be processed less frequently, which contradicts the requirement of processing records in less than 10 seconds.

Option C (Modify Partitions): While increasing parallelism through more shuffle partitions can improve performance, it does not address the fundamental issue of batch scheduling and could still lead to inconsistency during peak loads.

Conclusion:

By using the trigger once option and scheduling the job every 10 seconds, you ensure that each microbatch has sufficient time to process all available data thoroughly before the next cycle begins, aligning with the need to handle peak loads more predictably and efficiently.

References

Structured Streaming Programming Guide - Triggering

Databricks Jobs Scheduling

Questions 17

The data science team has created and logged a production using MLFlow. The model accepts a list of column names and returns a new column of type DOUBLE.

The following code correctly imports the production model, load the customer table containing the customer_id key column into a Dataframe, and defines the feature columns needed for the model.

Which code block will output DataFrame with the schema'' customer_id LONG, predictions DOUBLE''?

Options:

Model, predict (df, columns)

Df, map (lambda k:midel (x [columns]) ,select (''customer_id predictions'')

Df. Select (''customer_id''.

Model (''columns) alias (''predictions'')

Df.apply(model, columns). Select (''customer_id, prediction''

Buy Now

Questions 18

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

df has the following schema: device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT

Code block:

df.withWatermark("event_time", "10 minutes")

.groupBy(

________,

"device_id"

)

.agg(

avg("temp").alias("avg_temp"),

avg("humidity").alias("avg_humidity")

)

.writeStream

.format("delta")

.saveAsTable("sensor_avg")

Which line of code correctly fills in the blank within the code block to complete this task?

Options:

window("event_time", "5 minutes").alias("time")

to_interval("event_time", "5 minutes").alias("time")

"event_time"

lag("event_time", "5 minutes").alias("time")

Buy Now

Questions 19

A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company's data is stored in regional cloud storage in the United States.

The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.

Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?

Options:

Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.

Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.

Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.

Databricks leverages user workstations as the driver during interactive development; as such, users should always use a workspace deployed in a region they are physically near.

Databricks notebooks send all executable code from the user's browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.

Buy Now

Questions 20

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

Options:

spark.sql.files.maxPartitionBytes

spark.sql.autoBroadcastJoinThreshold

spark.sql.files.openCostInBytes

spark.sql.adaptive.coalescePartitions.minPartitionNum

spark.sql.adaptive.advisoryPartitionSizeInBytes

Buy Now

Questions 21

Streaming DataFrame df has the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:

to_interval("event_time", "5 minutes").alias("time")

window("event_time", "5 minutes").alias("time")

"event_time"

window("event_time", "10 minutes").alias("time")

lag("event_time", "10 minutes").alias("time")

Buy Now

Questions 22

Which Python variable contains a list of directories to be searched when trying to locate required modules?

Options:

importlib.resource path

,sys.path

os-path

pypi.path

pylib.source

Buy Now

Questions 23

Which statement describes the correct use of pyspark.sql.functions.broadcast?

Options:

It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.

It marks a column as small enough to store in memory on all executors, allowing a broadcast join.

It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.

It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.

It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.

Buy Now

Questions 24

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code:

df = spark.read.format("parquet").load(f"/mnt/source/(date)")

Which code block should be used to create the date Python variable used in the above code block?

Options:

date = spark.conf.get("date")

input_dict = input()

date= input_dict["date"]

import sys

date = sys.argv[1]

date = dbutils.notebooks.getParam("date")

dbutils.widgets.text("date", "null")

date = dbutils.widgets.get("date")

Buy Now

Questions 25

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

This table is partitioned by the date column. A query is run with the following filter:

longitude < 20 & longitude > -20

Which statement describes how data will be filtered?

Options:

Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.

No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.

The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.

Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.

The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.

Buy Now

Questions 26

A data engineer needs to capture pipeline settings from an existing in the workspace, and use them to create and version a JSON file to create a new pipeline.

Which command should the data engineer enter in a web terminal configured with the Databricks CLI?

Options:

Use the get command to capture the settings for the existing pipeline; remove the pipeline_id and rename the pipeline; use this in a create command

Stop the existing pipeline; use the returned settings in a reset command

Use the alone command to create a copy of an existing pipeline; use the get JSON command to get the pipeline definition; save this to git

Use list pipelines to get the specs for all pipelines; get the pipeline spec from the return results parse and use this to create a pipeline

Buy Now

Questions 27

Which statement describes Delta Lake Auto Compaction?

Options:

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.

Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.

Buy Now

Questions 28

A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():

Options:

return spark.readStream.table("bronze")

return spark.readStream.load("bronze")

return spark.read.option("readChangeFeed", "true").table ("bronze")

Buy Now

Questions 29

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.

Which of the following likely explains these smaller file sizes?

Options:

Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations

Z-order indices calculated on the table are preventing file compaction

C Bloom filler indices calculated on the table are preventing file compaction

Databricks has autotuned to a smaller target file size based on the overall size of data in the table

Databricks has autotuned to a smaller target file size based on the amount of data in each partition

Buy Now

Questions 30

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.

The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.

The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data.

Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?

Options:

Because the vacuum command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.

Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the vacuum job is run the following day.

Because Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.

Because Delta Lake's delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.

Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the vacuum job is run 8 days later.

Buy Now

Questions 31

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

Options:

Set the configuration delta.deduplicate = true.

VACUUM the Delta table after each batch completes.

Perform an insert-only merge with a matching condition on a unique key.

Perform a full outer join on a unique key and overwrite existing data.

Rely on Delta Lake schema enforcement to prevent duplicate records.

Buy Now

Questions 32

Which statement describes the default execution mode for Databricks Auto Loader?

Options:

New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.

Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and impotently into the target Delta Lake table.

Webhook trigger Databricks job to run anytime new data arrives in a source directory; new data automatically merged into target tables using rules inferred from the data.

New files are identified by listing the input directory; the target table is materialized by directory querying all valid files in the source directory.

Buy Now

Questions 33

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables.

Which approach will ensure that this requirement is met?

Options:

When a database is being created, make sure that the LOCATION keyword is used.

When configuring an external data warehouse for all table storage, leverage Databricks for all ELT.

When data is saved to a table, make sure that a full file path is specified alongside the Delta format.

When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.

When the workspace is being configured, make sure that external cloud object storage has been mounted.

Buy Now

Questions 34

A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.

When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?

Options:

The five Minute Load Average remains consistent/flat

Bytes Received never exceeds 80 million bytes per second

Total Disk Space remains constant

Network I/O never spikes

Overall cluster CPU utilization is around 25%

Buy Now

Questions 35

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

Options:

Task queueing resulting from improper thread pool assignment.

Spill resulting from attached volume storage being too small.

Network latency due to some cluster nodes being in different regions from the source data

Skew caused by more data being assigned to a subset of spark-partitions.

Credential validation errors while pulling data from an external system.

Buy Now

Questions 36

A table is registered with the following code:

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?

Options:

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

Buy Now

Questions 37

Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().

Which of the following statements is correct?

Options:

DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems.

By default, both the DBFS root and mounted data sources are only accessible to workspace administrators.

The DBFS root is the most secure location to store data, because mounted storage volumes must have full public read and write permissions.

Neither the DBFS root nor mounted storage can be accessed when using %sh in a Databricks notebook.

The DBFS root stores files in ephemeral block volumes attached to the driver, while mounted directories will always persist saved data to external storage between sessions.

Buy Now

Questions 38

A CHECK constraint has been successfully added to the Delta table named activity_details using the following logic:

A batch job is attempting to insert new records to the table, including a record where latitude = 45.50 and longitude = 212.67.

Which statement describes the outcome of this batch insert?

Options:

The write will fail when the violating record is reached; any records previously processed will be recorded to the target table.

The write will fail completely because of the constraint violation and no records will be inserted into the target table.

The write will insert all records except those that violate the table constraints; the violating records will be recorded to a quarantine table.

The write will include all records in the target table; any violations will be indicated in the boolean column named valid_coordinates.

The write will insert all records except those that violate the table constraints; the violating records will be reported in a warning log.

Buy Now

Questions 39

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.

Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?

Options:

Stage’s detail screen and Executor’s files

Stage’s detail screen and Query’s detail screen

Driver’s and Executor’s log files

Executor’s detail screen and Executor’s log files

Buy Now

Questions 40

The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response indicating the job run request was submitted successfully includes a field run_id. Which statement describes what the number alongside this field represents?

Options:

The job_id and number of times the job has been run are concatenated and returned.

The globally unique ID of the newly triggered run.

The job_id is returned in this field.

The number of times the job definition has been run in this workspace.

Buy Now

Exam Code: Databricks-Certified-Professional-Data-Engineer

Exam Name: Databricks Certified Data Engineer Professional Exam

Last Update: Oct 15, 2025

Questions: 195

PDF + Testing Engine

$66 ~~$164.99~~

Testing Engine (only)

$50 ~~$124.99~~

PDF (only)

$42 ~~$104.99~~

buy now Databricks-Certified-Professional-Data-Engineer

Summer Special Sale - Limited Time 60% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 575363r9

dumpspedia logo

Navigation:

Databricks-Certified-Professional-Data-Engineer Sample Questions Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation: