Databricks-Certified-Data-Engineer-Associate Sample Questions Answers

Questions 4

Which of the following Structured Streaming queries is performing a hop from a Silver table to a Gold table?

Options:

Buy Now

Questions 5

A global retail company sells products across multiple categories (e.g.. Electronics, Clothing) and regions (e.g.. North. South, East. West). The sales team has provided the data engineer with a PySpark dataframe named sales_df as below and the team wants the data engineer to analyze the sales data to help them make strategic decisions.

Options:

Category_sales = sales df.groupBy( " category " ).agg(sum( " sales amount " ) .alias ( " total sales amount " ))

Category_sales = sales_df.sum( " 3ales_amount " ). g-1- upBy( " categcryn).alias( " toLal_sales_amount))

Category_sale: .es df -agg (sum ( " sales amount " ) .-;r*i:rRy ( " category " ) .alias ( " total sa.en amount " ))

Category_sales = sales_df.groupBy( " reqion " ). agq(sum( " sales_amountn).alias(ntotal_sales_amount ' ' ))

Buy Now

Questions 6

A data engineer is inspecting an ETL pipeline based on a Pyspark job that consistently encounters performance bottlenecks. Based on developer feedback, the data engineer assumes the job is low on compute resources. To pinpoint the issue, the data engineer observes the Spark Ul and finds out the job has a high CPU time vs Task time.

Which course of action should the data engineer take?

Options:

High CPU time vs Task time means an under-utilized cluster. The data engineer may need to repartition data to spread the jobs more evenly throughout the cluster.

High CPU time vs Task time means efficient use of cluster and no change needed

High CPU time vs Task time means over-utilized memory and the need to increase parallelism

High CPU time vs Task time means a CPU over-utilized job. The data engineer may need to consider executor and core tuning or resizing the cluster

Buy Now

Questions 7

A data engineer wants to reduce costs and optimize cloud spending. The data engineer has decided to use Databricks Serverless for lowering cloud costs while maintaining existing SLAs.

What is the first step in migrating to Databricks Serverless?

Options:

Legacy Ingestion pipelines that include ingestion from sources API ' s, files, JDBC/ODBC connections

Low frequency Bl Dashboarding and Adhoc SQL Analytics

A frequently running and efficient Python-based data transformation pipeline compatible with the latest Databricks runtime and Unity Catalog

A frequently running and efficient Scala-based data transformation pipeline compatible with the latest Databricks runtime and Unity Catalog

Buy Now

Questions 8

Which of the following data workloads will utilize a Gold table as its source?

Options:

A job that enriches data by parsing its timestamps into a human-readable format

A job that aggregates uncleaned data to create standard summary statistics

A job that cleans data by removing malformatted records

A job that queries aggregated data designed to feed into a dashboard

A job that ingests raw data from a streaming source into the Lakehouse

Buy Now

Questions 9

A data engineer streams customer orders into a Kafka topic (orders_topic) and is currently writing the ingestion script of a DLT pipeline. The data engineer needs to ingest the data from Kafka brokers to DLT using Databricks

What is the correct code for ingesting the data?

Options:

Option A

Option B

Option C

Option D

Buy Now

Questions 10

Which of the following describes a scenario in which a data team will want to utilize cluster pools?

Options:

An automated report needs to be refreshed as quickly as possible.

An automated report needs to be made reproducible.

An automated report needs to be tested to identify errors.

An automated report needs to be version-controlled across multiple collaborators.

An automated report needs to be runnable by all stakeholders.

Buy Now

Questions 11

What Databricks feature can be used to check the data sources and tables used in a workspace?

Options:

Do not use the lineage feature as it only tracks activity from the last 3 months and will not provide full details on dependencies.

Use the lineage feature to visualize a graph that highlights where the table is used only in notebooks,

Use the lineage feature to visualize a graph that highlights where the table is used only in reports.

Use the lineage feature to visualize a graph that shows all dependencies, including where the table is used in notebooks, other tables, and reports.

Buy Now

Questions 12

A data engineer has three tables in a Delta Live Tables (DLT) pipeline. They have configured the pipeline to drop invalid records at each table. They notice that some data is being dropped due to quality concerns at some point in the DLT pipeline. They would like to determine at which table in their pipeline the data is being dropped.

Which of the following approaches can the data engineer take to identify the table that is dropping the records?

Options:

They can set up separate expectations for each table when developing their DLT pipeline.

They cannot determine which table is dropping the records.

They can set up DLT to notify them via email when records are dropped.

They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.

They can navigate to the DLT pipeline page, click on the “Error” button, and review the present errors.

Buy Now

Questions 13

A data engineer is using the OPTIMIZE command on a Delta table. What happens when OPTIMIZE is run twice on the same table with the same data?

Options:

It further reduces file sizes by re-clustering the data

Triggers a full liquid clustering process

Changes the number of tuples per file significantly

It has no effect because it is idempotent.

Buy Now

Questions 14

A data engineer is working on a Databricks project that utilizes cloud storage. The data engineer wants to load several json files from containers on a storage account as soon as the file arrives within the storage account.

Which syntax should the data engineer follow to first load the files into a dataframe and check that it is working as expected using Python?

Options:

df = spark.readStream.format( " json " ).load( " input/path " )

df = spark.readStream.format( " cloud " ),option( " json " ).load( " /input/path " )

df = spark.readStream.format( " cloudFiles " ) .option( " cloudFiles.format " , " json " ) .load( " /input/path " )

df = spark.read.json( " inp i./path " )

Buy Now

Questions 15

A data engineer is setting up access control in Unity Catalog and needs to ensure that a group of data analysts can query tables but not modify data.

Which permission should the data engineer grant to the data analysts?

Options:

SELECT

INSERT

MODIFY

ALL PRIVILEGES

Buy Now

Questions 16

An organization needs to share a dataset stored in its Databricks Unity Catalog with an external partner who uses a different data platform that is not Databricks. The goal is to maintain data security and ensure the partner can access the data efficiently.

Which method should the data engineer use to securely share the dataset with the external partner?

Options:

Using Delta Sharing with the open sharing protocol

Exporting data as CSV files and emailing them

Using a third-party API to access the Delta table

Databricks-to-Databricks Sharing

Buy Now

Questions 17

An organization is looking for an optimized storage layer that supports ACID transactions and schema enforcement. Which technology should the organization use?

Options:

Cloud File Storage

Unity Catalog

Data lake

Delta Lake

Buy Now

Questions 18

A data engineer wants to create an external table in Databricks that references data stored in an Azure Data Lake Storage (ADLS) location. The goal is to enable Databricks to access and query this external data without moving it into Databricks-managed storage.

Which step should the data engineer take to successfully create the external table?

Options:

Use the CREATE TABLE statement and specify the LOCATION clause with the path to the external data.

Use the CREATE UNMANAGED TABLE statement without specifying a LOCATION clause.

Use the CREATE EXTERNAL TABLE statement without specifying a LOCATION clause.

Use the CREATE MANAGED TABLE statement and specify the LOCATION clause with the path to the external data.

Buy Now

Questions 19

A single Job runs two notebooks as two separate tasks. A data engineer has noticed that one of the notebooks is running slowly in the Job’s current run. The data engineer asks a tech lead for help in identifying why this might be the case.

Which of the following approaches can the tech lead use to identify why the notebook is running slowly as part of the Job?

Options:

They can navigate to the Runs tab in the Jobs UI to immediately review the processing notebook.

They can navigate to the Tasks tab in the Jobs UI and click on the active run to review the processing notebook.

They can navigate to the Runs tab in the Jobs UI and click on the active run to review the processing notebook.

There is no way to determine why a Job task is running slowly.

They can navigate to the Tasks tab in the Jobs UI to immediately review the processing notebook.

Buy Now

Questions 20

A data engineer has written a function in a Databricks Notebook to calculate the population of bacteria in a given medium.

Analysts use this function in the notebook and sometimes provide input arguments of the wrong data type, which can cause errors during execution.

Which Databricks feature will help the data engineer quickly identify if an incorrect data type has been provided as input?

Options:

The Data Engineer should add print statements to find out what the variable is.

The Databricks debugger enables breakpoints that will raise an error if the wrong data type is submitted

The Spark User interface has a debug tab that contains the variables that are used in this session.

The Databricks debugger enables the use of a variable explorer to see at a glance the value of the variables.

Buy Now

Questions 21

A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:

DROP TABLE IF EXISTS my_table;

After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.

Which of the following describes why all of these files were deleted?

Options:

The table was managed

The table ' s data was smaller than 10 GB

The table ' s data was larger than 10 GB

The table was external

The table did not have a location

Buy Now

Questions 22

A data engineer has realized that the data files associated with a Delta table are incredibly small. They want to compact the small files to form larger files to improve performance.

Which of the following keywords can be used to compact the small files?

Options:

REDUCE

OPTIMIZE

COMPACTION

REPARTITION

VACUUM

Buy Now

Questions 23

A data engineer has joined an existing project and they see the following query in the project repository:

CREATE STREAMING LIVE TABLE loyal_customers AS

SELECT customer_id -

FROM STREAM(LIVE.customers)

WHERE loyalty_level = ' high ' ;

Which of the following describes why the STREAM function is included in the query?

Options:

The STREAM function is not needed and will cause an error.

The table being created is a live table.

The customers table is a streaming live table.

The customers table is a reference to a Structured Streaming query on a PySpark DataFrame.

The data in the customers table has been updated since its last run.

Buy Now

Questions 24

A data engineer has a single-task Job that runs each morning before they begin working. After identifying an upstream data issue, they need to set up another task to run a new notebook prior to the original task.

Which of the following approaches can the data engineer use to set up the new task?

Options:

They can clone the existing task in the existing Job and update it to run the new notebook.

They can create a new task in the existing Job and then add it as a dependency of the original task.

They can create a new task in the existing Job and then add the original task as a dependency of the new task.

They can create a new job from scratch and add both tasks to run concurrently.

They can clone the existing task to a new Job and then edit it to run the new notebook.

Buy Now

Questions 25

A data engineer needs to ingest from both streaming and batch sources for a firm that relies on highly accurate data. Occasionally, some of the data picked up by the sensors that provide a streaming input are outside the expected parameters. If this occurs, the data must be dropped, but the stream should not fail.

Which feature of Delta Live Tables meets this requirement?

Options:

Monitoring

Change Data Capture

Expectations

Error Handling

Buy Now

Questions 26

A company uses Delta Sharing to collaborate with partners across different cloud providers and geographic regions. What will result in additional costs due to cross-region or egress fees?

Options:

Transferring data via Delta Sharing across clouds and across different geographic regions

Sharing data within the same cloud provider and region

Utilizing Delta Sharing for internal data analytics within a single cloud environment

Accessing Delta Sharing data using a VPN within the same data center

Buy Now

Questions 27

A new data engineering team has been assigned to work on a project. The team will need access to database customers in order to see what tables already exist. The team has its own group team.

Which of the following commands can be used to grant the necessary permission on the entire database to the new team?

Options:

GRANT VIEW ON CATALOG customers TO team;

GRANT CREATE ON DATABASE customers TO team;

GRANT USAGE ON CATALOG team TO customers;

GRANT CREATE ON DATABASE team TO customers;

GRANT USAGE ON DATABASE customers TO team;

Buy Now

Questions 28

A dataset has been defined using Delta Live Tables and includes an expectations clause:

CONSTRAINT valid_timestamp EXPECT (timestamp > ' 2020-01-01 ' ) ON VIOLATION DROP ROW

What is the expected behavior when a batch of data containing data that violates these constraints is processed?

Options:

Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.

Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.

Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.

Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.

Records that violate the expectation cause the job to fail.

Buy Now

Questions 29

A data engineer wants to schedule their Databricks SQL dashboard to refresh once per day, but they only want the associated SQL endpoint to be running when it is necessary.

Which of the following approaches can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard?

Options:

They can ensure the dashboard’s SQL endpoint matches each of the queries’ SQL endpoints.

They can set up the dashboard’s SQL endpoint to be serverless.

They can turn on the Auto Stop feature for the SQL endpoint.

They can reduce the cluster size of the SQL endpoint.

They can ensure the dashboard’s SQL endpoint is not one of the included query’s SQL endpoint.

Buy Now

Questions 30

A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropriate permissions.

In which of the following locations can the data engineer review their permissions on the table?

Options:

Databricks Filesystem

Jobs

Dashboards

Repos

Data Explorer

Buy Now

Questions 31

A data engineer has a Python notebook in Databricks, but they need to use SQL to accomplish a specific task within a cell. They still want all of the other cells to use Python without making any changes to those cells.

Which of the following describes how the data engineer can use SQL within a cell of their Python notebook?

Options:

It is not possible to use SQL in a Python notebook

They can attach the cell to a SQL endpoint rather than a Databricks cluster

They can simply write SQL syntax in the cell

They can add %sql to the first line of the cell

They can change the default language of the notebook to SQL

Buy Now

Questions 32

A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables.

Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables?

Options:

None of these changes will need to be made

The pipeline will need to stop using the medallion-based multi-hop architecture

The pipeline will need to be written entirely in SQL

The pipeline will need to use a batch source in place of a streaming source

The pipeline will need to be written entirely in Python

Buy Now

Questions 33

Which of the following Git operations must be performed outside of Databricks Repos?

Options:

Commit

Pull

Push

Clone

Merge

Buy Now

Questions 34

A data engineer needs to create a table in Databricks using data from a CSV file at location /path/to/csv.

They run the following command:

Which of the following lines of code fills in the above blank to successfully complete the task?

Options:

None of these lines of code are needed to successfully complete the task

USING CSV

FROM CSV

USING DELTA

FROM " path/to/csv "

Buy Now

Questions 35

A Databricks single-task workflow fails at the last task due to an error in a notebook. The data engineer fixes the mistake in the notebook. What should the data engineer do to rerun the workflow?

Options:

Repair the task

Rerun the pipeline

Restart the Cluster

Switch the cluster

Buy Now

Questions 36

A data engineering team has two tables. The first table march_transactions is a collection of all retail transactions in the month of March. The second table april_transactions is a collection of all retail transactions in the month of April. There are no duplicate records between the tables.

Which of the following commands should be run to create a new table all_transactions that contains all records from march_transactions and april_transactions without duplicate records?

Options:

CREATE TABLE all_transactions ASSELECT * FROM march_transactionsINNER JOIN SELECT * FROM april_transactions;

CREATE TABLE all_transactions ASSELECT * FROM march_transactionsUNION SELECT * FROM april_transactions;

CREATE TABLE all_transactions ASSELECT * FROM march_transactionsOUTER JOIN SELECT * FROM april_transactions;

CREATE TABLE all_transactions ASSELECT * FROM march_transactionsINTERSECT SELECT * from april_transactions;

CREATE TABLE all_transactions ASSELECT * FROM march_transactionsMERGE SELECT * FROM april_transactions;

Buy Now

Questions 37

Identify a scenario to use an external table.

A Data Engineer needs to create a parquet bronze table and wants to ensure that it gets stored in a specific path in an external location.

Which table can be created in this scenario?

Options:

An external table where the location is pointing to specific path in external location.

An external table where the schema has managed location pointing to specific path in external location.

A managed table where the catalog has managed location pointing to specific path in external location.

A managed table where the location is pointing to specific path in external location.

Buy Now

Questions 38

A Python file is ready to go into production and the client wants to use the cheapest but most efficient type of cluster possible. The workload is quite small, only processing 10GBs of data with only simple joins and no complex aggregations or wide transformations.

Which cluster meets the requirement?

Options:

Job cluster with Photon enabled

Interactive cluster

Job cluster with spot instances disabled

Job cluster with spot instances enabled

Buy Now

Questions 39

A data engineer works for an organization that must meet a stringent Service Level Agreement (SLA) that demands minimal runtime errors and high availability for its data processing pipelines. The data engineer wants to avoid the operational overhead of managing and tuning clusters.

Which architectural solution will meet the requirements?

Options:

Implement a hybrid approach with scheduled batch jobs on custom cloud VMs.

Use an auto-scaling cluster configured and monitored by the user.

Utilize Databricks serverless compute that automatically optimizes resources and abstracts cluster management.

Deploy a dedicated, manually managed cluster optimized by in-house IT staff.

Buy Now

Questions 40

Which of the following commands will return the location of database customer360?

Options:

DESCRIBE LOCATION customer360;

DROP DATABASE customer360;

DESCRIBE DATABASE customer360;

ALTER DATABASE customer360 SET DBPROPERTIES ( ' location ' = ' /user ' };

USE DATABASE customer360;

Buy Now

Questions 41

Which query is performing a streaming hop from raw data to a Bronze table?

Options:

Option A

Option B

Option C

Option D

Buy Now

Questions 42

Which of the following must be specified when creating a new Delta Live Tables pipeline?

Options:

A key-value pair configuration

The preferred DBU/hour cost

A path to cloud storage location for the written data

A location of a target database for the written data

At least one notebook library to be executed

Buy Now

Questions 43

A data engineer needs to process SQL queries on a large dataset with fluctuating workloads. The workload requires automatic scaling based on the volume of queries, without the need to manage or provision infrastructure. The solution should be cost-efficient and charge only for the compute resources used during query execution.

Which compute option should the data engineer use?

Options:

Databricks SQL Analytics

Databricks Jobs

Databricks Runtime for ML

Serverless SQL Warehouse

Buy Now

Questions 44

A data engineer has a Job with multiple tasks that runs nightly. Each of the tasks runs slowly because the clusters take a long time to start.

Which of the following actions can the data engineer perform to improve the start up time for the clusters used for the Job?

Options:

They can use endpoints available in Databricks SQL

They can use jobs clusters instead of all-purpose clusters

They can configure the clusters to be single-node

They can use clusters that are from a cluster pool

They can configure the clusters to autoscale for larger data sizes

Buy Now

Questions 45

A data engineer is processing ingested streaming tables and needs to filter out NULL values in the order_datetime column from the raw streaming table orders_raw and store the results in a new table orders_valid using DLT.

Which code snippet should the data engineer use?

Options:

Option A

Option B

Option C

Option D

Buy Now

Questions 46

Which file format is used for storing Delta Lake Table?

Options:

Parquet

Delta

JSON

Buy Now

Questions 47

A data engineer needs to optimize the data layout and query performance for an e-commerce transactions Delta table. The table is partitioned by " purchase_date " a date column which helps with time-based queries but does not optimize searches on user statistics " customer_id " , a high-cardinality column.

The table is usually queried with filters on " customer_i

d " within specific date ranges, but since this data is spread across multiple files in each partition, it results in full partition scans and increased runtime and costs.

How should the data engineer optimize the Data Layout for efficient reads?

Options:

Alter table implementing liquid clustering on " customerid " while keeping the existing partitioning.

Alter the table to partition by " customer_id " .

Enable delta caching on the cluster so that frequent reads are cached for performance.

Alter the table implementing liquid clustering by " customer_id " and " purchase_date " .

Buy Now

Questions 48

A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level.

Which of the following tools can the data engineer use to solve this problem?

Options:

Unity Catalog

Data Explorer

Delta Lake

Delta Live Tables

Auto Loader

Buy Now

Questions 49

A data engineer at a company that uses Databricks with Unity Catalog needs to share a collection of tables with an external partner who also uses a Databricks workspace enabled for Unity Catalog. The data engineer decides to use Delta Sharing to accomplish this.

What is the first piece of information the data engineer should request from the external partner to set up Delta Sharing?

Options:

Their Databricks account password

The name of their Databricks cluster

The IP address of their Databricks workspace

The sharing identifier of their Unity Catalog metastore

Buy Now

Questions 50

A data engineer has realized that they made a mistake when making a daily update to a table. They need to use Delta time travel to restore the table to a version that is 3 days old. However, when the data engineer attempts to time travel to the older version, they are unable to restore the data because the data files have been deleted.

Which of the following explains why the data files are no longer present?

Options:

The VACUUM command was run on the table

The TIME TRAVEL command was run on the table

The DELETE HISTORY command was run on the table

The OPTIMIZE command was nun on the table

The HISTORY command was run on the table

Buy Now

Questions 51

A Databricks workflow fails at the last stage due to an error in a notebook. This workflow runs daily. The data engineer fixes the mistake and wants to rerun the pipeline. This workflow is very costly and time-intensive to run.

Which action should the data engineer do in order to minimise downtime and cost?

Options:

Switch to another cluster

Repair run

Re-run the entire workflow

Restart the cluster

Buy Now

Questions 52

A data engineer is developing a small proof of concept in a notebook. When running the entire notebook, cluster usage spikes. The data engineer wants to keep the development experience and get real-time results.

Which cluster meets these requirements?

Options:

All-Purpose Cluster with a large fixed memory size

All-Purpose Cluster with autoscaling

Job Cluster with autoscaling enabled

Job Cluster with Photon enabled and autoscaling

Buy Now

Exam Code: Databricks-Certified-Data-Engineer-Associate

Exam Name: Databricks Certified Data Engineer Associate Exam

Last Update: Jul 14, 2026

Questions: 230

PDF + Testing Engine

$59.99 ~~$171.4~~

Add to Cart

Testing Engine

$44.99 ~~$128.55~~

Add to Cart

PDF (Q&A)

$49.99 ~~$142.82~~

Add to Cart

Summer Sale - Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 65percent

dumpspedia logo

Navigation:

Databricks-Certified-Data-Engineer-Associate Sample Questions Answers

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation: