Databricks-Certified-Professional-Data-Scientist Sample Questions Answers

Questions 4

A denote the event 'student is female' and let B denote the event 'student is French'. In a class of 100 students suppose 60 are French, and suppose that 10 of the French students are females. Find the probability that if I pick a French student, it will be a girl, that is, find P(A|B).

Options:

1/3

2/3

1/6

2/6

Buy Now

Questions 5

A website is opened 3 times by a user. What is the probability of he clicks 2 times the advertisement, is best calculated by

Options:

Binomial

Poisson

Normal

Any of the above

Buy Now

Questions 6

Which of the following is a correct example of the target variable in regression (supervised learning)?

Options:

Nominal values like true, false

Reptile, fish, mammal, amphibian, plant, fungi

Infinite number of numeric values, such as 0.100, 42.001, 1000.743..

All of the above

Buy Now

Questions 7

Which of the following steps you will be using in the discovery phase?

Options:

What all are the data sources for the project?

Analyze the Raw data and its format and structure.

What all tools are required, in the project?

What is the network capacity required

What Unix server capacity required?

Buy Now

Questions 8

Which of the below best describe the Principal component analysis

Options:

Dimensionality reduction

Collaborative filtering

Classification

Regression

Clustering

Buy Now

Questions 9

You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient Sex, Height, Weight, Age and Income as measures and have used 3 clusters. When you create a pair-wise plot of the clusters, you notice that there is significant overlap between the clusters. What should you do?

Options:

Identify additional measures to add to the analysis

Remove one of the measures

Decrease the number of clusters

Increase the number of clusters

Buy Now

Questions 10

RMSE is a good measure of accuracy, but only to compare forecasting errors of different models for a______, as it is scale-dependent.

Options:

Between Variables

Particular Variable

Among all the variables

All of the above are correct

Buy Now

Questions 11

A problem statement is given as below

Hospital records show that of patients suffering from a certain disease, 75% die of it. What is the probability that of 6 randomly selected patients, 4 will recover?

Which of the following model will you use to solve it.

Options:

Binomial

Poisson

Normal

Any of the above

Buy Now

Questions 12

In which of the following scenario you should apply the Bay's Theorem

Options:

The sample space is partitioned into a set of mutually exclusive events {A1, A2, . .., An }.

Within the sample space, there exists an event B, for which P(B) > 0.

The analytical goal is to compute a conditional probability of the form: P(Ak | B ).

In all above cases

Buy Now

Questions 13

Refer to the Exhibit.

In the Exhibit, the table shows the values for the input Boolean attributes "A", "B", and "C". It also shows the values for the output attribute "class". Which decision tree is valid for the data?

Options:

Tree A

Tree B

Tree C

Tree D

Buy Now

Questions 14

You are designing a recommendation engine for a website where the ability to generate more personalized recommendations by analyzing information from the past activity of a specific user, or the history of other users deemed to be of similar taste to a given user. These resources are used as user profiling and helps the site recommend content on a user-by-user basis. The more a given user makes use of the system, the better the recommendations become, as the system gains data to improve its model of that user. What kind of this recommendation engine is ?

Options:

Naive Bayes classifier

Collaborative filtering

Logistic Regression

Content-based filtering

Buy Now

Questions 15

While working with Netflix the movie rating websites you have developed a recommender system that has produced ratings predictions for your data set that are consistently exactly 1 higher for the user-item pairs in your dataset than the ratings given in the dataset. There are n items in the dataset. What will be the calculated RMSE of your recommender system on the dataset?

Options:

n/2

Buy Now

Questions 16

Which of the following is not a correct application for the Classification?

Options:

credit scoring

tumor detection

image recognition

drug discovery

Buy Now

Questions 17

What is the best way to evaluate the quality of the model found by an unsupervised algorithm like k-means clustering, given metrics for the cost of the clustering (how well it fits the data) and its stability (how similar the clusters are across multiple runs over the same data)?

Options:

The lowest cost clustering subject to a stability constraint

The lowest cost clustering

The most stable clustering subject to a minimal cost constraint

The most stable clustering

Buy Now

Questions 18

Select the correct objectives of principal component analysis

Options:

To reduce the dimensionality of the data set

To identify new meaningful underlying variables

To discover the dimensionality of the data set

Only 1 and 2

All 1, 2 and 3

Buy Now

Questions 19

The method based on principal component analysis (PCA) evaluates the features according to

Options:

The projection of the largest eigenvector of the correlation matrix on the initial dimensions

According to the magnitude of the components of the discriminate vector

The projection of the smallest eigenvector of the correlation matrix on the initial dimensions

None of the above

Buy Now

Questions 20

A data scientist is asked to implement an article recommendation feature for an on-line magazine.

The magazine does not want to use client tracking technologies such as cookies or reading history. Therefore, only the style and subject matter of the current article is available for making recommendations. All of the magazine's articles are stored in a database in a format suitable for analytics.

Which method should the data scientist try first?

Options:

K Means Clustering

Naive Bayesian

Logistic Regression

Association Rules

Buy Now

Questions 21

What type of output generated in case of linear regression?

Options:

Continuous variable

Discrete Variable

Any of the Continuous and Discrete variable

Values between 0 and 1

Buy Now

Questions 22

Which of the following metrics are useful in measuring the accuracy and quality of a recommender system?

Options:

Cluster Density

Support Vector Count

Mean Absolute Error

Sum of Absolute Errors

Buy Now

Questions 23

Regularization is a very important technique in machine learning to prevent overfitting. Mathematically speaking, it adds a regularization term in order to prevent the coefficients to fit so perfectly to overfit. The difference between the L1 and L2 is...

Options:

L2 is the sum of the square of the weights, while L1 is just the sum of the weights

L1 is the sum of the square of the weights, while L2 is just the sum of the weights

L1 gives Non-sparse output while L2 gives sparse outputs

None of the above

Buy Now

Questions 24

Consider the following confusion matrix for a data set with 600 out of 11,100 instances positive:

In this case, Precision = 50%, Recall = 83%, Specificity = 95%, and Accuracy = 95%.

Select the correct statement

Options:

Precision is low, which means the classifier is predicting positives best

Precision is low, which means the classifier is predicting positives poorly

problem domain has a major impact on the measures that should be used to evaluate a classifier within it

1 and 3

2 and 3

Buy Now

Questions 25

Select the correct statement which applies to K-Nearest Neighbors

Options:

No Assumption about the data

Computationally expensive

Require less memory

Works with Numeric Values

Buy Now

Questions 26

Suppose you have made a model for the rating system, which rates between 1 to 5 stars. And you calculated that RMSE value is 1.0 then which of the following is correct

Options:

It means that your predictions are on average one star off of what people really think

It means that your predictions are on average two star off of what people really think

It means that your predictions are on average three star off of what people really think

It means that your predictions are on average four star off of what people really think

Buy Now

Questions 27

You are working in a classification model for a book, written by HadoopExam Learning Resources and decided to use building a text classification model

for determining whether this book is for Hadoop or Cloud computing. You have to select the proper features (feature selection) hence, to cut down on the size of the feature space, you will use the mutual information of each word with the label of hadoop or cloud to select the 1000 best features to use as input to a Naive Bayes model. When you compare the performance of a model built with the 250 best features to a model built with the 1000 best features, you notice that the model with only 250 features performs slightly better on our test data.

What would help you choose better features for your model?

Options:

Include least mutual information with other selected features as a feature selection criterion

Include the number of times each of the words appears in the book in your model

Decrease the size of our training data

Evaluate a model that only includes the top 100 words

Buy Now

Questions 28

Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several variables that may be......

Options:

Numerical

Categorical

Both 1 and 2 are correct

None of the 1 and 2 are correct

Buy Now

Questions 29

In which of the scenario you can use the linear regression model?

Options:

Predicting Home Price based on the location and house area

Predicting demand of the goods and services based on the weather

Predicting tumor size reduction based on input as number of radiation treatment

Predicting sales of the text book based on the number of students in state

Buy Now

Questions 30

If you are trying to predict or forecast a discrete target value, then which is the correct options

Options:

Supervised Learning regression algorithms

Supervised Learning classification algorithms

Un supervised Learning

Density estimation algorithm

Buy Now

Questions 31

You are creating a regression model with the input income, education and current debt of a customer, what could be the possible output from this model.

Options:

Customer fit as a good

Customer fit as acceptable or average category

expressed as a percent, that the customer will default on a loan

1 and 3 are correct

2 and 3 are correct

Buy Now

Questions 32

You are studying the behavior of a population, and you are provided with multidimensional data at the individual level. You have identified four specific individuals who are valuable to your study, and would like to find all users who are most similar to each individual. Which algorithm is the most appropriate for this study?

Options:

Association rules

Decision trees

Linear regression

K-means clustering

Buy Now

Questions 33

Scenario: Suppose that Bob can decide to go to work by one of three modes of transportation,

car, bus, or commuter train. Because of high traffic, if he decides to go by car. there is a 50% chance he will be late. If he goes by bus, which has special reserved lanes but is sometimes overcrowded, the probability of being late is only 20%. The commuter train is almost never late, with a probability of only 1 %, but is more expensive than the bus.

Suppose that Bob is late one day, and his boss wishes to estimate the probability that he drove to work that day by car. Since he does not know Which mode of transportation Bob usually uses, he gives a prior probability of 1 3 to each of the three possibilities. Which of the following method the boss will use to estimate of the probability that Bob drove to work?

Options:

Naive Bayes

Linear regression

Random decision forests

None of the above

Buy Now

Questions 34

Select the correct algorithm of unsupervised algorithm

Options:

K-Nearest Neighbors

K-Means

Support Vector Machines

Naive Bayes

Buy Now

Questions 35

Select the correct statement which applies to logistic regression

Options:

Computationally inexpensive, easy to implement knowledge representation easy to interpret

May have low accuracy

Works with Numeric values

Only 1 and 3 are correct

All 1, 2 and 3 are correct

Buy Now

Questions 36

A bio-scientist is working on the analysis of the cancer cells. To identify whether the cell is cancerous or not, there has been hundreds of tests are done with small variations to say yes to the problem. Given the test result for a sample of healthy and cancerous cells, which of the following technique you will use to determine whether a cell is healthy?

Options:

Linear regression

Collaborative filtering

Naive Bayes

Identification Test

Buy Now

Questions 37

Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign, the amount of time spent campaigning negatively and whether or not the candidate is an incumbent.

Above is an example of

Options:

Linear Regression

Logistic Regression

Recommendation system

Maximum likelihood estimation

Hierarchical linear models

Buy Now

Questions 38

In which phase of the data analytics lifecycle do Data Scientists spend the most time in a project?

Options:

Discovery

Data Preparation

Model Building

Communicate Results

Buy Now

Questions 39

What describes a true limitation of Logistic Regression method?

Options:

It does not handle redundant variables well.

It does not handle missing values well.

It does not handle correlated variables well.

It does not have explanatory values.

Buy Now

Questions 40

You are having 1000 patients' data with the height and age. Where age in years and height in meters. You wanted to create cluster using this two attributes. You wanted to have near equal effect for both the age and height while creating the cluster. What you can do?

Options:

You will be adding height with the numeric value 100

You will be converting each height value to centimeters

You will be dividing both age and height with their respective standard deviation

You will be taking square root of height

Buy Now

Questions 41

Feature Hashing approach is "SGD-based classifiers avoid the need to predetermine vector size by simply picking a reasonable size and shoehorning the training data into vectors of that size" now with large vectors or with multiple locations per feature in Feature hashing?

Options:

Is a problem with accuracy

It is hard to understand what classifier is doing

It is easy to understand what classifier is doing

Is a problem with accuracy as well as hard to understand what classifier us doing

Buy Now

Exam Code: Databricks-Certified-Professional-Data-Scientist

Exam Name: Databricks Certified Professional Data Scientist Exam

Last Update: Jul 14, 2026

Questions: 138

PDF + Testing Engine

$59.99 ~~$171.4~~

Add to Cart

Testing Engine

$44.99 ~~$128.55~~

Add to Cart

PDF (Q&A)

$49.99 ~~$142.82~~

Add to Cart

Summer Sale - Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 65percent

dumpspedia logo

Navigation:

Databricks-Certified-Professional-Data-Scientist Sample Questions Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options: