CSC 180-01/02 Intelligent Systems (Fall 2021) Project 1: Yelp Business Rating Prediction using Tensorflow Due at 11:59 pm, Thursday, September 30, 2021 Peer Review: class time, Friday, October 1, 2021...

1 answer below »
everything is in file


CSC 180-01/02 Intelligent Systems (Fall 2021) Project 1: Yelp Business Rating Prediction using Tensorflow Due at 11:59 pm, Thursday, September 30, 2021 Peer Review: class time, Friday, October 1, 2021 1. Problem Formulation In this project, we aim to predict a business's stars rating based on all the review text for that business using neural network implementations in TensorFlow. Consider this problem as a regression problem. (1) Report the RMSE and plot the lift chart of the BEST neural network model you have obtained. (2) Choose 3-5 arbitrary businesses from your test data (preferably from different categories). Show the names, the true star ratings, and the predicted ratings (from your best model) of those businesses. 2. Dataset https://www.yelp.com/dataset The dataset contains several JSON files. You can find the format of the data here: https://www.yelp.com/dataset/documentation/main Example file formats are as follows. https://www.yelp.com/dataset https://www.yelp.com/dataset/documentation/main 3. Data Cleaning In this project, we will only consider the businesses with at least 20 reviews. So remove all the businesses with less than 20 reviews. 4. Requirements • You are required to split data to training and test. Use training data to train your models and evaluate the model quality using test data. • Use TF-IDF to extract features from reviews. If you experience low memory issue when using tfidfVectorzier, set parameters max_df, min_df, and max_features appropriately. • You must use EarlyStopping when training neural networks using Tensorflow. • Tuning the following hyperparameters when training neural networks using Tensorflow and record how they affect performance in your report. Tabulate your findings. • Activation: relu, sigmoid, tanh • Number of layers and neuron count for each layer • Optimizer: adam and sgd. 5. Grading breakdown You may feel this project is described with some certain degree of vagueness, which is left on purpose. In other words, creativity is strongly encouraged. Your grade for this project will be based on the soundness of your design, the novelty of your work, and the effort you put into the project. Use the evaluation form on Canvas as a checklist to make sure your work meet all the requirements. 6. Teaming: Students must work in teams with no more than 3 people. Think clearly about who will do what on the project. Normally people in the same group will receive the same grade. However, the instructor reserve the right to assign different grades to team members depending on their contributions. So you should choose partner carefully! 7. Deliverables: (1) The HTML version of your notebook that includes all your source code. Go to “File” and then “Download as”. Click “HTML” to convert the notebook to HTML. 5 pts will be deducted for the incorrect file format. (2) Your report in PDF format, with your name, your id, course title, assignment id, and due date on the first page. As for length, I would expect a report with more than one page. Your report should include the following sections (but not limited to): (1) Problem Statement (2) Methodology (3) Experimental Results and Analysis (4) Task Division and Project Reflection In the section “Task Division and Project Reflection”, describe the following: • who is responsible for which part, • challenges your group encountered and how you solved them • and what you have learned from the project as a team. 10 pts will be deducted for missing the section of task division and project reflection. To submit your notebook and the report, go to Canvas “Assignments” and use “Project X (submit your code and report here)”. Use the evaluation form on Canvas as a checklist to make sure your work meet all the requirements. (3) Link to your video presentation shared to the discussion board. Each team have three minutes to demo your work. Failure to submit the video presentation will result in zero point for the project. The following is how you should allocate your time: • Model/code design (1 minute) • Findings/results (1 minute) • Task division, challenges encountered, and what you learned from the project (1 minutes) To submit the link to your video presentation, go to Canvas “Discussions” and use “Post Your Presentation for Project X Here”. Share your link by replying directly to my main discussion post. All the deliverables must be submitted by team leader on Canvas before 11:59 pm, Thursday, September 30, 2021 NO late submissions will be accepted. 8. Peer Review: During the class after the deadline, please review and comment on the presentations from other teams by replying to their posts. It is a great chance for you to learn from other people’s work. Please be nice, and provide constructive, specific feedbacks. You will become a better, more effective learner when you found yourself in a community of active learners! 9. Coding Hints • You may use the following code to convert JSON data into a tabular format Pandas can read. import json import csv import pandas as pd outfile = open("review_stars.tsv", 'w') sfile = csv.writer(outfile, delimiter ="\t", quoting=csv.QUOTE_MINIMAL) sfile.writerow(['business_id','stars', 'text']) with open('yelp_academic_dataset_review.json', encoding="utf-8") as f: for line in f: row = json.loads(line) # some special char must be encoded in 'utf-8' sfile.writerow([row['business_id'], row['stars'], (row['text']).encode('utf-8')]) outfile.close() df= pd.read_csv('review_stars.tsv', delimiter ="\t", encoding="utf-8") • You may use the following sample code to group ALL the reviews by each business and create a new dataframe, where each line is a business with all its reviews aggregated together. From there, you then use tfidfVectorzier to obtain TFIDF representation for each business. df_review_agg = df.groupby('business_id')['text'].sum() df_ready_for_sklearn = pd.DataFrame({'business_id': df_review_agg.index, 'all_reviews': df_review_agg.values}) • To align all the reviews of a business with its business star rating, you may want to join the review table with the business table on the business_id column. Pandas supports high performance SQL join operations. Use Pandas function pd.merge() to merge (or to say, join) two dataframes based on values in one particular column. See examples here: https://chrisalbon.com/code/python/data_wrangling/pandas_join_merge_dataframe/ • If you want to merge two numpy arrays, use Numpy function np.concatenate() • Convert a Pandas Dataframe to its corresponding Numpy array representation, use to_numpy() • For one-hot coding, you may use Pandas pd.get_dummies(). 10. Think beyond the Project • Can you build a more accurate model by taking the number of reviews (review count) into account? • What other information can be used to train a more accurate model? Business categories? Check-in count? • Can you build a more accurate model by focusing only on a particular business category? https://chrisalbon.com/code/python/data_wrangling/pandas_join_merge_dataframe/  Use TF-IDF to extract features from reviews. If you experience low memory issue when using tfidfVectorzier, set parameters max_df, min_df, and max_features appropriately.  You must use EarlyStopping when training neural networks using Tensorflow.  Tuning the following hyperparameters when training neural networks using Tensorflow and record how they affect performance in your report. Tabulate your findings.  To align all the reviews of a business with its business star rating, you may want to join the review table with the business table on the business_id column. Pandas supports high performance SQL join operations. Use Pandas function pd.merge() to ... https://chrisalbon.com/code/python/data_wrangling/pandas_join_merge_dataframe/  If you want to merge two numpy arrays, use Numpy function np.concatenate() CSC 180-01/02 Intelligent Systems (Fall 2021) Evaluation Form of Project 1 (Rubrics) Group Member Name ____________________ Id ____________________ Group Member Name ____________________ Id ____________________ Group Member Name ____________________ Id ____________________ Your report will be graded based on how you addressed the following items in the report:
Answered 4 days AfterSep 20, 2021

Answer To: CSC 180-01/02 Intelligent Systems (Fall 2021) Project 1: Yelp Business Rating Prediction using...

Karthi answered on Sep 23 2021
145 Votes
nn_network
In [ ]:

import numpy as np
import pandas as pd
from scipy import sparse
import matplotlib.pyplot as plt
import seaborn as sns
import time
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn import preprocessing
from sk
learn.metrics import roc_auc_score
from sklearn import svm
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, SpatialDropout1D, GRU
from keras.layers import Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from sklearn.neighbors import KNeighborsClassifier
from keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from keras.models import Sequential
%matplotlib inline

In [ ]:

business = pd.read_csv("/home/kai/yelp_dataset/business.csv")
review_all = pd.read_csv("/home/kai/yelp_dataset/review.csv")

In [ ]:

a = business[business['categories'].str.contains('Restaurant') == True]
rev = review_all[review_all.business_id.isin(a['business_id']) == True]

In [ ]:

rev_samp = rev.sample(n = 350000, random_state = 42)
train = rev_samp[0:280000]
test = rev_samp[280000:]

In [ ]:

train.shape, test.shape


Out[ ]:
((280000, 9), (70000, 9))
In [ ]:

train = train[['text', 'stars']]
train['stars'].hist();train.head()


Out[ ]:
        text    stars
    2760442    Second time here.... first time had the pulled...    5
    3014452    Great place. Like their sauce and lunch specia...    5
    2876979    So goooooooood and so simple! I love their pel...    5
    469097    We stopped in for a late lunch on a Tuesday af...    3
    4971248    A great option to try hakka chinese since its ...    4


In [ ]:

train = pd.get_dummies(train, columns = ['stars'])
train.head()


Out[ ]:
        text    stars_1    stars_2    stars_3    stars_4    stars_5
    2760442    Second time here.... first time had the pulled...    0    0    0    0    1
    3014452    Great place. Like their sauce and lunch specia...    0    0    0    0    1
    2876979    So goooooooood and so simple! I love their pel...    0    0    0    0    1
    469097    We stopped in for a late lunch on a Tuesday af...    0    0    1    0    0
    4971248    A great option to try hakka chinese since its ...    0    0    0    1    0
In [ ]:

test = test[['text', 'stars']]
test = pd.get_dummies(test, columns = ['stars'])
train.shape, test.shape


Out[ ]:
((280000, 6), (70000, 6))
In [ ]:

train_samp = train.sample(frac = .1, random_state = 42)
test_samp = test.sample(frac = .1, random_state = 42)
train_samp.shape, test_samp.shape


Out[ ]:
((28000, 6), (7000, 6))
In [ ]:

max_features = 2000
tfidf = TfidfVectorizer(max_features = max_features)

In [ ]:

class NBFeatures(BaseEstimator):
'''Class implementation of Jeremy Howards NB Linear model'''
def __init__(self, alpha):
# Smoothing Parameter: always going to be one for my use
self.alpha = alpha

def preprocess_x(self, x, r):
return x.multiply(r)

# calculate probabilities
def pr(self, x, y_i, y):
p = x[y == y_i].sum(0)
return (p + self.alpha)/((y==y_i).sum()+self.alpha)

# calculate the log ratio and represent as sparse matrix
# ie fit the nb model
def fit(self, x, y = None):
self._r = sparse.csr_matrix(np.log(self.pr(x, 1, y) /self.pr(x, 0, y)))
return self

# apply the nb fit to original features x
def transform(self, x):
x_nb = self.preprocess_x(x, self._r)
return x_nb

In [ ]:

lr = LogisticRegression()
nb = NBFeatures(1)
p = Pipeline([
('tfidf', tfidf),
('nb', nb),
('lr', lr)
])

In [ ]:

class_names = ['stars_1', 'stars_2', 'stars_3', 'stars_4', 'stars_5']
scores = []
preds = np.zeros((len(test_samp), len(class_names)))
for i, class_name in enumerate(class_names):
train_target = train_samp[class_name]
cv_score = np.mean(cross_val_score(estimator = p, X = train_samp['text'].values,
y = train_target, cv = 3, scoring = 'accuracy'))
scores.append(cv_score)
print('CV score for class {} is {}'.format(class_name, cv_score))
p.fit(train_samp['text'].values, train_target)
preds[:,i] = p.predict_proba(test_samp['text'].values)[:,1]



CV score for class stars_1 is 0.9282499819604656
CV score for class stars_2 is 0.90339283521352
CV score for class stars_3 is 0.8591786654537303
CV score for class stars_4 is 0.7321071676830603
CV score for class stars_5 is...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here