Summer Special Sale - Limited Time 60% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 575363r9

Welcome To DumpsPedia

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Sample Questions Answers

Questions 4

A developer is running Spark SQL queries and notices underutilization of resources. Executors are idle, and the number of tasks per stage is low.

What should the developer do to improve cluster utilization?

Options:

A.

Increase the value of spark.sql.shuffle.partitions

B.

Reduce the value of spark.sql.shuffle.partitions

C.

Increase the size of the dataset to create more partitions

D.

Enable dynamic resource allocation to scale resources as needed

Buy Now
Questions 5

A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The cluster has 10 nodes, each with 16 CPUs. Spark UI shows:

Low number of Active Tasks

Many tasks complete in milliseconds

Fewer tasks than available CPUs

Which approach should be used to adjust the partitioning for optimal resource allocation?

Options:

A.

Set the number of partitions equal to the total number of CPUs in the cluster

B.

Set the number of partitions to a fixed value, such as 200

C.

Set the number of partitions equal to the number of nodes in the cluster

D.

Set the number of partitions by dividing the dataset size (1 TB) by a reasonable partition size, such as 128 MB

Buy Now
Questions 6

Which command overwrites an existing JSON file when writing a DataFrame?

Options:

A.

df.write.mode("overwrite").json("path/to/file")

B.

df.write.overwrite.json("path/to/file")

C.

df.write.json("path/to/file", overwrite=True)

D.

df.write.format("json").save("path/to/file", mode="overwrite")

Buy Now
Questions 7

A data engineer is streaming data from Kafka and requires:

Minimal latency

Exactly-once processing guarantees

Which trigger mode should be used?

Options:

A.

.trigger(processingTime='1 second')

B.

.trigger(continuous=True)

C.

.trigger(continuous='1 second')

D.

.trigger(availableNow=True)

Buy Now
Questions 8

Given the code fragment:

import pyspark.pandas as ps

psdf = ps.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

Options:

A.

psdf.to_spark()

B.

psdf.to_pyspark()

C.

psdf.to_pandas()

D.

psdf.to_dataframe()

Buy Now
Questions 9

A developer notices that all the post-shuffle partitions in a dataset are smaller than the value set forspark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold.

Which type of join will Adaptive Query Execution (AQE) choose in this case?

Options:

A.

A Cartesian join

B.

A shuffled hash join

C.

A broadcast nested loop join

D.

A sort-merge join

Buy Now
Questions 10

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing agroupByoperation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)

C)

D)

Options:

A.

Use theapplyInPandasAPI:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()

B.

Use themapInPandasAPI:

df.mapInPandas(mean_func, schema="user_id long, value double").show()

C.

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()

D.

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()

Buy Now
Questions 11

What is the difference betweendf.cache()anddf.persist()in Spark DataFrame?

Options:

A.

Bothcache()andpersist()can be used to set the default storage level (MEMORY_AND_DISK_SER)

B.

Both functions perform the same operation. Thepersist()function provides improved performance asits default storage level isDISK_ONLY.

C.

persist()- Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) andcache()- Can be used to set different storage levels to persist the contents of the DataFrame.

D.

cache()- Persists the DataFrame with the default storage level (MEMORY_AND_DISK) andpersist()- Can be used to set different storage levels to persist the contents of the DataFrame

Buy Now
Questions 12

An engineer notices a significant increase in the job execution time during the execution of a Spark job. After some investigation, the engineer decides to check the logs produced by the Executors.

How should the engineer retrieve the Executor logs to diagnose performance issues in the Spark application?

Options:

A.

Locate the executor logs on the Spark master node, typically under the/tmpdirectory.

B.

Use the commandspark-submitwith the—verboseflag to print the logs to the console.

C.

Use the Spark UI to select the stage and view the executor logs directly from the stages tab.

D.

Fetch the logs by running a Spark job with thespark-sqlCLI tool.

Buy Now
Questions 13

A data scientist is working with a Spark DataFrame called customerDF that contains customer information.The DataFrame has a column named email with customer email addresses. The data scientist needs to split this column into username and domain parts.

Which code snippet splits the email column into username and domain columns?

Options:

A.

customerDF.select(

col("email").substr(0, 5).alias("username"),

col("email").substr(-5).alias("domain")

)

B.

customerDF.withColumn("username", split(col("email"), "@").getItem(0)) \

.withColumn("domain", split(col("email"), "@").getItem(1))

C.

customerDF.withColumn("username", substring_index(col("email"), "@", 1)) \

.withColumn("domain", substring_index(col("email"), "@", -1))

D.

customerDF.select(

regexp_replace(col("email"), "@", "").alias("username"),

regexp_replace(col("email"), "@", "").alias("domain")

)

Buy Now
Questions 14

A developer is trying to join two tables,sales.purchases_fctandsales.customer_dim, using the following code:

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'))

The developer has discovered that customers in thepurchases_fcttable that do not exist in thecustomer_dimtable are being dropped from the joined table.

Which change should be made to the code to stop these customer records from being dropped?

Options:

A.

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'left')

B.

fact_df = cust_df.join(purch_df, F.col('customer_id') == F.col('custid'))

C.

fact_df = purch_df.join(cust_df, F.col('cust_id') == F.col('customer_id'))

D.

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'right_outer')

Buy Now
Questions 15

What is the relationship between jobs, stages, and tasks during execution in Apache Spark?

Options:

Options:

A.

A job contains multiple stages, and each stage contains multiple tasks.

B.

A job contains multiple tasks, and each task contains multiple stages.

C.

A stage contains multiple jobs, and each job contains multiple tasks.

D.

A stage contains multiple tasks, and each task contains multiple jobs.

Buy Now
Questions 16

Given the following code snippet inmy_spark_app.py:

What is the role of the driver node?

Options:

A.

The driver node orchestrates the execution by transforming actions into tasks and distributing them to worker nodes

B.

The driver node only provides the user interface for monitoring the application

C.

The driver node holds the DataFrame data and performs all computations locally

D.

The driver node stores the final result after computations are completed by worker nodes

Buy Now
Questions 17

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

The resulting Python dictionary must contain a mapping of region-> region id containing the smallest 3 region_idvalues.

Which code fragment meets the requirements?

A)

B)

C)

D)

The resulting Python dictionary must contain a mapping ofregion -> region_idfor the smallest 3region_idvalues.

Which code fragment meets the requirements?

Options:

A.

regions = dict(

regions_df

.select('region', 'region_id')

.sort('region_id')

.take(3)

)

B.

regions = dict(

regions_df

.select('region_id', 'region')

.sort('region_id')

.take(3)

)

C.

regions = dict(

regions_df

.select('region_id', 'region')

.limit(3)

.collect()

)

D.

regions = dict(

regions_df

.select('region', 'region_id')

.sort(desc('region_id'))

.take(3)

)

Buy Now
Questions 18

A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

Options:

A.

Optimize the data processing logic by repartitioning the DataFrame.

B.

Modify the Spark configuration to disable garbage collection

C.

Increase the memory allocated to the Spark Driver.

D.

Cache large DataFrames to persist them in memory.

Buy Now
Questions 19

A data engineer noticed improved performance after upgrading from Spark 3.0 to Spark 3.5. The engineer found that Adaptive Query Execution (AQE) was enabled.

Which operation is AQE implementing to improve performance?

Options:

A.

Dynamically switching join strategies

B.

Collecting persistent table statistics and storing them in the metastore for future use

C.

Improving the performance of single-stage Spark jobs

D.

Optimizing the layout of Delta files on disk

Buy Now
Questions 20

A data scientist has identified that some records in the user profile table contain null values in any of the fields, and such records should be removed from the dataset before processing. The schema includes fields like user_id, username, date_of_birth, created_ts, etc.

The schema of the user profile table looks like this:

Which block of Spark code can be used to achieve this requirement?

Options:

Options:

A.

filtered_df = users_raw_df.na.drop(thresh=0)

B.

filtered_df = users_raw_df.na.drop(how='all')

C.

filtered_df = users_raw_df.na.drop(how='any')

D.

filtered_df = users_raw_df.na.drop(how='all', thresh=None)

Buy Now
Questions 21

An engineer has a large ORC file located at/file/test_data.orcand wants to read only specific columns to reduce memory usage.

Which code fragment will select the columns, i.e.,col1,col2, during the reading process?

Options:

A.

spark.read.orc("/file/test_data.orc").filter("col1 = 'value' ").select("col2")

B.

spark.read.format("orc").select("col1", "col2").load("/file/test_data.orc")

C.

spark.read.orc("/file/test_data.orc").selected("col1", "col2")

D.

spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Buy Now
Questions 22

A Data Analyst is working on the DataFramesensor_df, which contains two columns:

Which code fragment returns a DataFrame that splits therecordcolumn into separate columns and has one array item per row?

A)

B)

C)

D)

Options:

A.

exploded_df = sensor_df.withColumn("record_exploded", explode("record"))

exploded_df = exploded_df.select("record_datetime", "sensor_id", "status", "health")

B.

exploded_df = exploded_df.select(

"record_datetime",

"record_exploded.sensor_id",

"record_exploded.status",

"record_exploded.health"

)

exploded_df = sensor_df.withColumn("record_exploded", explode("record"))

C.

exploded_df = exploded_df.select(

"record_datetime",

"record_exploded.sensor_id",

"record_exploded.status",

"record_exploded.health"

)

exploded_df = sensor_df.withColumn("record_exploded", explode("record"))

D.

exploded_df = exploded_df.select("record_datetime", "record_exploded")

Buy Now
Questions 23

A data scientist is working on a large dataset in Apache Spark using PySpark. The data scientist has a DataFramedfwith columnsuser_id,product_id, andpurchase_amountand needs to perform some operations on this data efficiently.

Which sequence of operations results in transformations that require a shuffle followed by transformations that do not?

Options:

A.

df.filter(df.purchase_amount > 100).groupBy("user_id").sum("purchase_amount")

B.

df.withColumn("discount", df.purchase_amount * 0.1).select("discount")

C.

df.withColumn("purchase_date", current_date()).where("total_purchase > 50")

D.

df.groupBy("user_id").agg(sum("purchase_amount").alias("total_purchase")).repartition(10)

Buy Now
Questions 24

A data scientist wants each record in the DataFrame to contain:

The first attempt at the code does read the text files but each record contains a single line. This code is shown below:

The entire contents of a file

The full file path

The issue: reading line-by-line rather than full text per file.

Code:

corpus = spark.read.text("/datasets/raw_txt/*") \

.select('*','_metadata.file_path')

Which change will ensure one record per file?

Options:

Options:

A.

Add the option wholetext=True to the text() function

B.

Add the option lineSep='\n' to the text() function

C.

Add the option wholetext=False to the text() function

D.

Add the option lineSep=", " to the text() function

Buy Now
Questions 25

The following code fragment results in an error:

@F.udf(T.IntegerType())

def simple_udf(t: str) -> str:

return answer * 3.14159

Which code fragment should be used instead?

Options:

A.

@F.udf(T.IntegerType())

def simple_udf(t: int) -> int:

return t * 3.14159

B.

@F.udf(T.DoubleType())

def simple_udf(t: float) -> float:

return t * 3.14159

C.

@F.udf(T.DoubleType())

def simple_udf(t: int) -> int:

return t * 3.14159

D.

@F.udf(T.IntegerType())

def simple_udf(t: float) -> float:

return t * 3.14159

Buy Now
Exam Code: Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5
Exam Name: Databricks Certified Associate Developer for Apache Spark 3.5-Python
Last Update: Oct 1, 2025
Questions: 85
$66  $164.99
$50  $124.99
$42  $104.99
buy now Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5