[15 points] k-Nearest-Neighbors You sell IT products and are using kNN to build an IT wallet estimation predictor. You have information on the total IT budgets of a large set of companies, that will...


[15 points] k-Nearest-Neighbors


You sell IT products and are using kNN to build an IT wallet estimation predictor. You have information on the total IT budgets of a large set of companies, that will be your database of potential neighbors. You already have decided to use Euclidian distance. Now you want to estimate your wallet share for Acme Corp., one of your current customers for whom you do not know the IT budget. Explain precisely how you will estimate your wallet share for Acme with this technique including (a) stating the target variable, (b) proposing 3 features for predicting the target variable, (c) restriction on the choice of k and (d) evaluation.




Microsoft Word - Exampaper-ECON7880 HONG KONG BAPTIST UNIVERSITY Page: 1 SEMESTER 1 EXAMINATION, 2020-2021 Course Code: ECON7880 Section Number: 1 Time Allowed: 3 Hour(s) Course Title: Foundations in Big Data Analytics: Concepts and Techniques Total No. of Pages: Q1. [25 points] Confusion Matrix Table 1 is a confusion matrix generated by Model A. The notations ? and ? represent the actual positive and negative. The rows Y and N represent predicted decisions “Yes (Offer)” and “No (No offer)” generated by Model A. There are 100,000 observations in total. Table 1 ? ? Y 56,000 6,000 N 5,000 33,000 Suppose that the correct positive prediction, i.e. predict Y for ?, yields $5, and incorrect positive prediction, i.e. Y for ?, yields -$1. No loss nor benefit for negative predictions, i.e. the benefit or cost for predicting an N for ? and ? is $0. (a) Calculate the overall expected value for Model A per person. Show the calculation steps and state your answer. (b) Write down the confusion matrix for the majority model. Majority model is either an “All- No model” if ? is the majority or “All-Yes model” if ? is the majority. (c) Calculate the overall expected value for the majority model per person. Show the calculation steps and state your answer. (d) Assume same number of Y offers as in Table 1. Write down the confusion matrix for the random model. (e) Calculate the expected overall profit for the random model per person. Show the calculation steps and state your answer. Q2. [15 points] k-Nearest-Neighbors You sell IT products and are using kNN to build an IT wallet estimation predictor. You have information on the total IT budgets of a large set of companies, that will be your database of potential neighbors. You already have decided to use Euclidian distance. Now you want to estimate your wallet share for Acme Corp., one of your current customers for whom you do not know the IT budget. Explain precisely how you will estimate your wallet share for Acme with this technique including (a) stating the target variable, (b) proposing 3 features for predicting the target variable, (c) restriction on the choice of k and (d) evaluation. Skylarhihi HONG KONG BAPTIST UNIVERSITY Page: 2 SEMESTER 1 EXAMINATION, 2020-2021 Course Code: ECON7880 Section Number: 1 Time Allowed: 3 Hour(s) Course Title: Foundations in Big Data Analytics: Concepts and Techniques Total No. of Pages: Q3. [25 points] Visualization Curves A population of 100,000 customers with 2 types ? (respond) and ? (not respond) has an unbalanced data structure with (?, ?) = (25000,75000). The overall response rate is 25%. A predictive model ranks the probability scores of the customers in descending order as shown in Table 2. Table 2 Percentage of Targeted Customers Cumulative Responses 10,000 8,000 20,000 14,000 30,000 18,000 40,000 19,000 50,000 20,000 60,000 21,000 70,000 22,000 80,000 23,000 90,000 24,000 100,000 25,000 (a) Plot the cumulative response curve at these 10 points with their corresponding (?, ?) coordinates, the axis-label and the title of the plot. (b) Plot the lift curve at these 10 points with their corresponding (?, ?) coordinates, the axis- label and the title of the plot. (c) The cost of each incentive offer is $12. The marketing campaign is subject to a budget constraint of $240,000. How many customers can the firm target? (d) Suppose the revenue of each customer response is $40 and the cost of each incentive offer is $12. Write down the cost-benefit matrix. ? ? Y N (e) Based on (d), calculate the expected profit per offer? Note: profit=revenue-cost. HONG KONG BAPTIST UNIVERSITY Page: 3 SEMESTER 1 EXAMINATION, 2020-2021 Course Code: ECON7880 Section Number: 1 Time Allowed: 3 Hour(s) Course Title: Foundations in Big Data Analytics: Concepts and Techniques Total No. of Pages: Q4. [15 points] Naive Bayes Table 3(a) shows the feature “????ℎ??” and decision “????” of 14 instances. The frequency table is summarized in Table 3(b). Table 3(a) Instance ??????? ???? Instance ??????? ???? #1 Sunny No #8 Rainy No #2 Overcast Yes #9 Sunny Yes #3 Rainy Yes #10 Rainy Yes #4 Sunny Yes #11 Sunny No #5 Sunny Yes #12 Overcast Yes #6 Overcast Yes #13 Overcast Yes #7 Rainy No #14 Rainy No Table 3(b) Frequency Table Weather No Yes Overcast 4 Rainy 3 2 Sunny 2 3 Total 5 9 (a) Using Table 3(b) to calculate the marginal likelihood for each type of the weather, i.e. ?(????????), ?(?????) and ?(?????). (b) Using Table 3(b) to calculate the marginal likelihood for each decision, i.e. ?(??) and ?(???). (c) Calculate the conditional probability ?(???|?????) using Bayes’ rule. HONG KONG BAPTIST UNIVERSITY Page: 4 SEMESTER 1 EXAMINATION, 2020-2021 Course Code: ECON7880 Section Number: 1 Time Allowed: 3 Hour(s) Course Title: Foundations in Big Data Analytics: Concepts and Techniques Total No. of Pages: Q5. [20 points] Text Mining Consider two documents ? and ? . Either of these contains the word “?????” or “?????” as shown in Table 4. Table 4 ?? ?? ????? 5 0 ????? 1 2 Total words in document 75 100 Suppose there are 10,000 documents in the entire corpus and the word “?????” appears in 4,000 of these documents and “?????” in 1000 of these documents. (a) Calculate the four normalized term frequencies (TF) for “?????” and “?????” in ? and ? . (b) Calculate the inverse document frequency (IDF) for these two words. (c) Calculate the four TF-IDF (i) TF-IDF(“?????”,? ), (ii) TF-IDF(“?????”,? ), (iii) TF- IDF(“?????”,? ), and (iv) TF-IDF(“?????”,? ). (d) Now we have a new query “Hello World”. Calculate the cosine similarity with ? and ? respectively. Which one is more similar to the search query?
Dec 11, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here