Expert should have excellent domain knowledge then only allocate.Overview It is well-known that...

Question

Expert should have excellent domain knowledge then only allocate.

Overview It is well-known that missing values are one of the biggest challenges in data science projects. You might know that k nearest neighbour based Collaborative Filtering is also called \memory-based" Collaborative Filtering. Luckily, data scientists and researchers have been working hard to solve the missing value problem in k-neighbourhood-based Collab- orative Filtering, and have got solutions there. In this assignment, you are required to tackle the missing value problem in Collab- orative Filtering by predicting them. Specifically, an existing solution about how to predict the missing values in Collaborative Filtering is provided, which is a report named \Effective Missing Data Prediction for Collaborative Filtering". Please read this report carefully, then complete the following tasks. Task 1: Implementation In this task, you are required to implement the solution in the provided report so as to predict the missing values in Collaborative Filtering. Note, you are required to implement your own implementation, and please do not use any other libraries that are related to Recommender Systems or Collaborative Filtering. If you use any of these libraries, your implementation part will be invalid. We provide Python framework code (named assignment3 framework.ipynb) to help you get started, and this will also automate the correctness marking. The framework also includes the training data and the test data. Please only put your own code in the provided cell in the framework as shown in Figure 1, Please DO NOT CHANGE anything else in the rest cells of the framework, otherwise they might cause errors during the automatic marking. Please provide detailed comments to explain your implementation. To what level of details should you provide in your solution? Please take the comments in the ipynb _les in Week 10 (knn based cf updated.zip) as examples for the level of detailed comments you are expected to put for your solution. You might find the following information uesful: https://www.w3schools.com/python/python_comments.asp https://www.w3schools.com/python/python_comments.asp Task 2: Presentation • The presentation should {Explain how the solution in the provided report predicts the missing values in the Collaborative Filtering by using your own language clearly and completely. {Explain why the solution in the provided report can tackle the missing value problem in Collaborative Filtering clearly and completely. {Explain how you implement the solution clearly and completely. • The presentation should be no more than 10 minutes. • Your presentation slides should be: {Microsoft PowerPoint slides (with audio inserted for each slide by using: Insert �� > Audio �� > Record Audio). {or you can create your own presentation slides (e.g. PDF version) and please submit your own recording (in the format of mp4 or avi) of your presentation as well. Note: - 1. Main menu! Kernel! Restart & Run All 2. Wait till you see the output displayed properly. You should see all the data printed and graphs displayed. Effective Missing Data Prediction for Collaborative Filtering Effective Missing Data Prediction for Collaborative Filtering Hao Ma, Irwin King and Michael R. Lyu Dept. of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong { hma, king, lyu }@cse.cuhk.edu.hk ABSTRACT Memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, al- though these approaches all suffer from data sparsity and poor prediction quality problems. Usually, the user-item matrix is quite sparse, which directly leads to inaccurate rec- ommendations. This paper focuses the memory-based col- laborative filtering problems on two crucial factors: (1) sim- ilarity computation between users or items and (2) miss- ing data prediction algorithms. First, we use the enhanced Pearson Correlation Coefficient (PCC) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. Second, we propose an effective missing data prediction al- gorithm, in which information of both users and items is taken into account. In this algorithm, we set the similarity threshold for users and items respectively, and the predic- tion algorithm will determine whether predicting the missing data or not. We also address how to predict the missing data by employing a combination of user and item information. Finally, empirical studies on dataset MovieLens have shown that our newly proposed method outperforms other state- of-the-art collaborative filtering algorithms and it is more robust against data sparsity. Categories and Subject Descriptors: H.3.3 [Informa- tion Systems]: Information Search and Retrieval - Informa- tion Filtering General Terms: Algorithm, Performance, Experimenta- tion. Keywords: Collaborative Filtering, Recommender System, Data Prediction, Data Sparsity. 1. INTRODUCTION Collaborative filtering is the method which automatically predicts the interest of an active user by collecting rating information from other similar users or items, and related techniques have been widely employed in some large, fa- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’07, July�23–27,�2007,�Amsterdam,�The�Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. mous commercial systems, such as Amazon1 , Ebay2 . The underlying assumption of collaborative filtering is that the active user will prefer those items which the similar users prefer. The research of collaborative filtering started from memory-based approaches which utilize the entire user-item database to generate a prediction based on user or item sim- ilarity. Two types of memory-based methods have been studied: user-based [2, 7, 10, 22] and item-based [5, 12, 17]. User-based methods first look for some similar users who have similar rating styles with the active user and then employ the ratings from those similar users to predict the ratings for the active user. Item-based methods share the same idea with user-based methods. The only difference is user-based methods try to find the similar users for an active user but item-based methods try to find the simi- lar items for each item. Whether in user-based approaches or in item-based approaches, the computation of similarity between users or items is a very critical step. Notable sim- ilarity computation algorithms include Pearson Correlation Coefficient (PCC) [16] and Vector Space Similarity (VSS) algorithm [2]. Although memory-based approaches have been widely used in recommendation systems [12, 16], the problem of inaccu- rate recommendation results still exists in both user-based and item-based approaches. The fundamental problem of memory-based approaches is the data sparsity of the user- item matrix. Many recent algorithms have been proposed to alleviate the data sparsity problem. In [21], Wang et al. proposed a generative probabilistic framework to exploit more of the data available in the user-item matrix by fusing all ratings with a predictive value for a recommendation to be made. Xue et al. [22] proposed a framework for collab- orative filtering which combines the strengths of memory- based approaches and model-based approaches by introduc- ing a smoothing-based method, and solved the data sparsity problem by predicting all the missing data in a user-item ma- trix. Although the simulation showed that this approach can achieve better performance than other collaborative filtering algorithms, the cluster-based smoothing algorithm limited the diversity of users in each cluster and predicting all the missing data in the user-item matrix could bring negative influence for the recommendation of active users. In this paper, we first use PCC-based significance weight- ing to compute similarity between users and items, which overcomes the potential decrease of similarity accuracy. Sec- ond, we propose an effective missing data prediction algo- 1http://www.amazon.com/. 2http://www.half.ebay.com/. SIGIR 2007 Proceedings Session 2: Routing and Filtering 39 rithm which exploits the information both from users and items. Moreover, this algorithm will predict the missing data of a user-item matrix if and only if we think it will bring positive influence for the recommendation of active users instead of predicting every missing data of the user-item matrix. The simulation shows our novel approach achieves better performance than other state-of-the-art collaborative filtering approaches. The remainder of this paper is organized as follows. In Section 2, we provide an overview of several major approaches for collaborative filtering. Section 3 shows the method of similarity computation. The framework of our missing data prediction and collaborative filtering is introduced in Sec- tion 4. The results of an empirical analysis are presented in Section 5, followed by a conclusion in Section 6. 2. RELATED WORK In this section, we review several major approaches for collaborative filtering. Two types of collaborative filtering approaches are widely studied: memory-based and model- based. 2.1 Memory-based approaches The memory-based approaches are the most popular pre- diction methods and are widely adopted in commercial col- laborative filtering systems [12, 16]. The most analyzed ex- amples of memory-based collaborative filtering include user- based approaches [2, 7, 10, 22] and item-based approaches [5, 12, 17]. User-based approaches predict the ratings of active users based on the ratings of similar users found, and item- based approaches predict the ratings of active users based on the information of similar items computed. User-based and item-based approaches often use PCC algorithm [16] and VSS algorithm [2] as the similarity computation meth- ods. PCC-based collaborative filtering generally can achieve higher performance than the other popular algorithm VSS, since it considers the differences of user rating styles. 2.2 Model-based Approaches In the model-based approaches, training datasets are used to train a predefined model. Examples of model-based ap- proaches include clustering models [11, 20, 22], aspect mod- els [8, 9, 19] and latent factor model [3]. [11] presented an algorithm for collaborative filtering based on hierarchical clustering, which tried to balance robustness and accuracy of predictions, especially when little data were available. Au- thors in [8] proposed an algorithm based on a generaliza- tion of probabilistic latent semantic analysis to continuous- valued response variables. The model-based approaches are often time-consuming to build and update, and cannot cover as diverse a user range as the memory-based approaches do [22]. 2.3 Other Related Approaches In order to take the advantages of memory-based and model-based approaches, hybrid collaborative filtering meth- ods have been studied recently [14, 22]. [1, 4] unified collab- orative filtering and content-based filtering, which achieved significant improvements over the standard approaches. At the same time, in order

project-3-2t5uuroa.pdf effectivemissingdatapredictionforcollaborativefiltering-3lvdqhzy.pdf assignment3framework-x1xlc515.ipynb -quiotxmz.ds_store

Shreyan · Accepted Answer

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "N3xQa6VlNAOe"
   },
   "source": [
    "# Assignment 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "id": "0mG3n0qRNTSA",
    "outputId": "cb0eadbd-b8a8-41b1-e3b4-7f44b47a8804"
   },
   "outputs": [
    {
     "ename": "FileNotFoundError",
     "evalue": "[Errno 2] No such file or directory: 'ml-100k/u.data'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mFileNotFoundError\u001b[0m                         Traceback (most recent call last)",
      "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m
\u001b[0;32m      4\u001b[0m \u001b[1;31m# Load MovieLens 100K dataset into a dataframe of pandas\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[0;32m      5\u001b[0m \u001b[0mnames\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;34m'user_id'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'item_id'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'rating'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'timestamp'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[1;32m----> 6\u001b[1;33m \u001b[0mdf\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'ml-100k/u.data'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0msep\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'\t'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnames\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mnames\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[0m\u001b[0;32m      7\u001b[0m \u001b[0mdf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mhead\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
",
      "\u001b[1;32mD:\_System_Folder\Anaconda3\lib\site-packages\pandas\io\parsers.py\u001b[0m in \u001b[0;36mread_csv\u001b[1;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)\u001b[0m
\u001b[0;32m    684\u001b[0m     )
\u001b[0;32m    685\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m
\u001b[1;32m--> 686\u001b[1;33m     \u001b[1;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[0m\u001b[0;32m    687\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m
\u001b[0;32m    688\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m
",
      "\u001b[1;32mD:\_System_Folder\Anaconda3\lib\site-packages\pandas\io\parsers.py\u001b[0m in \u001b[0;36m_read\u001b[1;34m(filepath_or_buffer, kwds)\u001b[0m
\u001b[0;32m    450\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m
\u001b[0;32m    451\u001b[0m     \u001b[1;31m# Create the parser.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[1;32m--> 452\u001b[1;33m     \u001b[0mparser\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mTextFileReader\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfp_or_buf\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[0m\u001b[0;32m    453\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m
\u001b[0;32m    454\u001b[0m     \u001b[1;32mif\u001b[0m \u001b[0mchunksize\u001b[0m \u001b[1;32mor\u001b[0m \u001b[0miterator\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
",
      "\u001b[1;32mD:\_System_Folder\Anaconda3\lib\site-packages\pandas\io\parsers.py\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, f, engine, **kwds)\u001b[0m
\u001b[0;32m    934\u001b[0m             \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m"has_index_names"\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mkwds\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m"has_index_names"\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[0;32m    935\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m
\u001b[1;32m--> 936\u001b[1;33m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_make_engine\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[0m\u001b[0;32m    937\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m
\u001b[0;32m    938\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
",
      "\u001b[1;32mD:\_System_Folder\Anaconda3\lib\site-packages\pandas\io\parsers.py\u001b[0m in \u001b[0;36m_make_engine\u001b[1;34m(self, engine)\u001b[0m
\u001b[0;32m   1166\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0m_make_engine\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mengine\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m"c"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[0;32m   1167\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mengine\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m"c"\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[1;32m-> 1168\u001b[1;33m             \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_engine\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mCParserWrapper\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mf\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[0m\u001b[0;32m   1169\u001b[0m         \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[0;32m   1170\u001b[0m             \u001b[1;32mif\u001b[0m \u001b[0mengine\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m"python"\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
",
      "\u001b[1;32mD:\_System_Folder\Anaconda3\lib\site-packages\pandas\io\parsers.py\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, src, **kwds)\u001b[0m
\u001b[0;32m   1996\u001b[0m         \u001b[0mkwds\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m"usecols"\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0musecols\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[0;32m   1997\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m
\u001b[1;32m-> 1998\u001b[1;33m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_reader\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mparsers\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mTextReader\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msrc\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[0m\u001b[0;32m   1999\u001b[0m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0munnamed_cols\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_reader\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0munnamed_cols\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[0;32m   2000\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m
",
      "\u001b[1;32mpandas\_libs\parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader.__cinit__\u001b[1;34m()\u001b[0m
",
      "\u001b[1;32mpandas\_libs\parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._setup_parser_source\u001b[1;34m()\u001b[0m
",
      "\u001b[1;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'ml-100k/u.data'"
     ]
    }
   ],
   "source": [
    "import pandas as pd
",
    "import numpy as np
",
    "
",
    "# Load MovieLens 100K dataset into a dataframe of pandas
",
    "names = ['user_id', 'item_id', 'rating', 'timestamp']
",
    "df = pd.read_csv('ml-100k/u.data', sep='\t', names=names)
",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "hFJT0QWudp1P"
   },
   "outputs": [],
   "source": [
    "# Select 500 most active users and 500 most active items from the dataset
",
    "n_most_active_users = 500
",
    "n_most_active_items = 500
",
    "
",
    "user_ids = df.groupby('user_id').count().sort_values(by='rating', ascending=False).head(n_most_active_users).index
",
    "item_ids = df.groupby('item_id').count().sort_values(by='rating', ascending=False).head(n_most_active_items).index
",
    "df = df[(df['user_id'].isin(user_ids)) & (df['item_id'].isin(item_ids))]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "r012fu0jJkJc"
   },
   "outputs": [],
   "source": [
    "# Map new internal ID for items
",
    "i_ids = df['item_id'].unique().tolist()
",
    "item_dict = dict(zip(i_ids, [i for i in range(len(i_ids))]))
",
    "df['item_id'] = df['item_id'].map(item_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vZ3rlC7jO6WJ"
   },
   "source": [
    "# Split Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "EwS00lvHO-ca",
    "outputId": "f9f3a031-b4e7-439e-c3e3-c165b2286d19"
   },
   "outputs": [],
   "source": [
    "# The number of training users and active users
",
    "n_training_users = 300
",
    "n_active_users = n_most_active_users - n_training_users
",
    "
",
    "# The number of GIVEN ratings for active users
",
    "GIVEN = 20
",
    "
",
    "# Randomly select users from the most active users as training set
",
    "random_uids = np.random.choice(df.user_id.unique(), n_training_users, replace=False)
",
    "train_df = df[df['user_id'].isin(random_uids)]
",
    "# Map new internal ID for all users in the training set
",
    "u_ids = train_df['user_id'].unique().tolist()
",
    "user_dict = dict(zip(u_ids, [i for i in range(len(u_ids))]))
",
    "train_df['user_id'] = train_df['user_id'].map(user_dict)
",
    "
",
    "# The rest of users are active users for testing
",
    "remain_df = df[~df['user_id'].isin(random_uids)]
",
    "# Map new internal ID for all active users
",
    "u_ids = remain_df['user_id'].unique().tolist()
",
    "user_dict = dict(zip(u_ids, [i for i in range(len(u_ids))]))
",
    "remain_df['user_id'] = remain_df['user_id'].map(user_dict)
",
    "
",
    "# Randomly select GIVEN ratings for active users
",
    "active_df = remain_df.groupby('user_id').sample(n=GIVEN, random_state=1024)
",
    "
",
    "test_df = remain_df[~remain_df.index.isin(active_df.index)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "c-ke62G3jiYb",
    "outputId": "1c135984-edb4-4f2b-fc73-9225d86a0c38"
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'n_training_users' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m
\u001b[0;32m      1\u001b[0m \u001b[1;31m# Convert the format of datasets to matrices\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m
\u001b[1;32m----> 2\u001b[1;

Overview It is well-known that missing values are one of the biggest challenges in data science projects. You might know that k nearest neighbour based Collaborative Filtering is also called...

Answer To: Overview It is well-known that missing values are one of the biggest challenges in data science...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment