everything is in fileCSC 180-01/02 Intelligent Systems (Fall 2021) Project 1: Yelp Business...

Question

everything is in fileCSC 180-01/02 Intelligent Systems (Fall 2021)    Project 1:  Yelp Business Rating Prediction using Tensorflow     Due at 11:59 pm, Thursday, September 30, 2021    Peer Review:  class time, Friday, October 1, 2021        1. Problem Formulation   In this project, we aim to predict a business's stars rating based on all the review text for that business  using neural network implementations in TensorFlow.  Consider this problem as a regression problem.     (1) Report the RMSE and plot the lift chart of the BEST neural network model you have obtained.  (2) Choose 3-5 arbitrary businesses from your test data (preferably from different categories).    Show the names, the true star ratings, and the predicted ratings (from your best model) of those  businesses.      2. Dataset     https://www.yelp.com/dataset          The dataset contains several JSON files.  You can find the format of the data here:      https://www.yelp.com/dataset/documentation/main              Example file formats are as follows.    https://www.yelp.com/dataset https://www.yelp.com/dataset/documentation/main         3. Data Cleaning    In this project, we will only consider the businesses with at least 20 reviews.  So remove all the  businesses with less than 20 reviews.      4. Requirements    • You are required to split data to training and test.  Use training data to train your models and  evaluate the model quality using test data.    • Use TF-IDF to extract features from reviews.  If you experience low memory issue when  using tfidfVectorzier, set parameters max_df,  min_df, and max_features appropriately.  • You must use EarlyStopping when training neural networks using Tensorflow.   • Tuning the following hyperparameters when training neural networks using Tensorflow and  record how they affect performance in your report.  Tabulate your findings.   • Activation: relu, sigmoid, tanh  • Number of layers and neuron count for each layer  • Optimizer: adam and sgd.    5. Grading breakdown  You may feel this project is described with some certain degree of vagueness, which is left on purpose.   In other words, creativity is strongly encouraged.  Your grade for this project will be based on the  soundness of your design, the novelty of your work, and the effort you put into the project.       Use the evaluation form on Canvas as a checklist to make sure your work meet all the requirements.       6. Teaming:    Students must work in teams with no more than 3 people. Think clearly about who will do what on the  project. Normally people in the same group will receive the same grade.  However, the instructor reserve  the right to assign different grades to team members depending on their contributions. So you should  choose partner carefully!        7. Deliverables:       (1) The HTML version of your notebook that includes all your source code.   Go to “File” and  then “Download as”. Click “HTML” to convert the notebook to HTML.       5 pts will be deducted for the incorrect file format.  (2) Your report in PDF format, with your name, your id, course title, assignment id, and due date  on the first page.  As for length, I would expect a report with more than one page.  Your report  should include the following sections (but not limited to):    (1) Problem Statement  (2) Methodology  (3) Experimental Results and Analysis  (4) Task Division and Project Reflection    In the section “Task Division and Project Reflection”, describe the following:  • who is responsible for which part,   • challenges your group encountered and how you solved them  • and what you have learned from the project as a team.       10 pts will be deducted for missing the section of task division and project reflection.     To submit your notebook and the report, go to Canvas “Assignments” and use “Project X  (submit your code and report here)”.   Use the evaluation form on Canvas as a checklist to  make sure your work meet all the requirements.     (3) Link to your video presentation shared to the discussion board.   Each team have three  minutes to demo your work. Failure to submit the video presentation will result in zero point  for the project. The following is how you should allocate your time:    • Model/code design (1 minute)  • Findings/results (1 minute)  • Task division, challenges encountered, and what you learned from the project (1 minutes)     To submit the link to your video presentation, go to Canvas “Discussions” and use “Post  Your Presentation for Project X Here”.   Share your link by replying directly to my main  discussion post.        All the deliverables must be submitted by team leader on Canvas before     11:59 pm, Thursday, September 30, 2021      NO late submissions will be accepted.              8. Peer Review:     During the class after the deadline, please review and comment on the presentations from other teams by  replying to their posts.  It is a great chance for you to learn from other people’s work.  Please be nice,  and provide constructive, specific feedbacks.   You will become a better, more effective learner when  you found yourself in a community of active learners!      9. Coding Hints    • You may use the following code to convert JSON data into a tabular format Pandas can read.     import json  import csv  import pandas as pd    outfile = open("review_stars.tsv", 'w')  sfile = csv.writer(outfile, delimiter ="	", quoting=csv.QUOTE_MINIMAL)  sfile.writerow(['business_id','stars', 'text'])    with open('yelp_academic_dataset_review.json', encoding="utf-8") as f:      for line in f:          row = json.loads(line)          # some special char must be encoded in 'utf-8'          sfile.writerow([row['business_id'], row['stars'], (row['text']).encode('utf-8')])            outfile.close()    df= pd.read_csv('review_stars.tsv', delimiter ="	", encoding="utf-8")      • You may use the following sample code to group ALL the reviews by each business and create a  new dataframe, where each line is a business with all its reviews aggregated together.  From  there, you then use tfidfVectorzier to obtain TFIDF representation for each business.     df_review_agg = df.groupby('business_id')['text'].sum() df_ready_for_sklearn = pd.DataFrame({'business_id': df_review_agg.index, 'all_reviews':  df_review_agg.values})  • To align all the reviews of a business with its business star rating, you may want to join the  review table with the business table on the business_id column.  Pandas supports high  performance SQL join operations.  Use Pandas function pd.merge() to merge (or to say, join)  two dataframes based on values in one particular column.   See examples here:  https://chrisalbon.com/code/python/data_wrangling/pandas_join_merge_dataframe/  • If you want to merge two numpy arrays, use Numpy function np.concatenate()    • Convert a Pandas Dataframe to its corresponding Numpy array representation, use to_numpy()    • For one-hot coding, you may use Pandas pd.get_dummies().          10. Think beyond the Project      • Can you build a more accurate model by taking the number of reviews (review count) into  account?      • What other information can be used to train a more accurate model?    Business categories?     Check-in count?    • Can you build a more accurate model by focusing only on a particular business category?    https://chrisalbon.com/code/python/data_wrangling/pandas_join_merge_dataframe/ 	 Use TF-IDF to extract features from reviews.  If you experience low memory issue when using tfidfVectorzier, set parameters max_df,  min_df, and max_features appropriately. 	 You must use EarlyStopping when training neural networks using Tensorflow. 	 Tuning the following hyperparameters when training neural networks using Tensorflow and record how they affect performance in your report.  Tabulate your findings. 	 To align all the reviews of a business with its business star rating, you may want to join the review table with the business table on the business_id column.  Pandas supports high performance SQL join operations.  Use Pandas function pd.merge() to ... 	https://chrisalbon.com/code/python/data_wrangling/pandas_join_merge_dataframe/ 	 If you want to merge two numpy arrays, use Numpy function np.concatenate()   CSC 180-01/02 Intelligent Systems (Fall 2021) Evaluation Form of Project 1 (Rubrics) Group Member    Name        ____________________                                          Id         ____________________    Group Member    Name        ____________________                                          Id         ____________________    Group Member    Name        ____________________                                          Id         ____________________    Your report will be graded based on how you addressed the following items in the report:

Karthi · Accepted Answer

nn_network
In [ ]:
     
import numpy as np
import pandas as pd
from scipy import sparse
import matplotlib.pyplot as plt
import seaborn as sns
import time
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn import preprocessing
from sklearn.metrics import roc_auc_score
from sklearn import svm
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, SpatialDropout1D, GRU
from keras.layers import Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from sklearn.neighbors import KNeighborsClassifier
from keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from keras.models import Sequential
%matplotlib inline
     
In [ ]:
     
business = pd.read_csv("/home/kai/yelp_dataset/business.csv")
review_all = pd.read_csv("/home/kai/yelp_dataset/review.csv")
     
In [ ]:
     
a = business[business['categories'].str.contains('Restaurant') == True]
rev = review_all[review_all.business_id.isin(a['business_id']) == True]
     
In [ ]:
     
rev_samp = rev.sample(n = 350000, random_state = 42)
train = rev_samp[0:280000]
test = rev_samp[280000:]
     
In [ ]:
     
train.shape, test.shape
     
    
    Out[ ]:
((280000, 9), (70000, 9))
In [ ]:
     
train = train[['text', 'stars']]
train['stars'].hist();train.head()
     
    
    Out[ ]:
		text	stars
	2760442	Second time here.... first time had the pulled...	5
	3014452	Great place. Like their sauce and lunch specia...	5
	2876979	So goooooooood and so simple! I love their pel...	5
	469097	We stopped in for a late lunch on a Tuesday af...	3
	4971248	A great option to try hakka chinese since its ...	4
    
    
In [ ]:
     
train = pd.get_dummies(train, columns = ['stars'])
train.head()
     
    
    Out[ ]:
		text	stars_1	stars_2	stars_3	stars_4	stars_5
	2760442	Second time here.... first time had the pulled...	0	0	0	0	1
	3014452	Great place. Like their sauce and lunch specia...	0	0	0	0	1
	2876979	So goooooooood and so simple! I love their pel...	0	0	0	0	1
	469097	We stopped in for a late lunch on a Tuesday af...	0	0	1	0	0
	4971248	A great option to try hakka chinese since its ...	0	0	0	1	0
In [ ]:
     
test = test[['text', 'stars']]
test = pd.get_dummies(test, columns = ['stars'])
train.shape, test.shape
     
    
    Out[ ]:
((280000, 6), (70000, 6))
In [ ]:
     
train_samp = train.sample(frac = .1, random_state = 42)
test_samp = test.sample(frac = .1, random_state = 42)
train_samp.shape, test_samp.shape
     
    
    Out[ ]:
((28000, 6), (7000, 6))
In [ ]:
     
max_features = 2000
tfidf = TfidfVectorizer(max_features = max_features)
     
In [ ]:
     
class NBFeatures(BaseEstimator):
    '''Class implementation of Jeremy Howards NB Linear model'''
    def __init__(self, alpha):
        # Smoothing Parameter: always going to be one for my use
        self.alpha = alpha
        
    def preprocess_x(self, x, r):
        return x.multiply(r)
    
    # calculate probabilities
    def pr(self, x, y_i, y):
        p = x[y == y_i].sum(0)
        return (p + self.alpha)/((y==y_i).sum()+self.alpha)
    
    # calculate the log ratio and represent as sparse matrix
    # ie fit the nb model
    def fit(self, x, y = None):
        self._r = sparse.csr_matrix(np.log(self.pr(x, 1, y) /self.pr(x, 0, y)))
        return self
    
    # apply the nb fit to original features x
    def transform(self, x):
        x_nb = self.preprocess_x(x, self._r)
        return x_nb
     
In [ ]:
     
lr = LogisticRegression()
nb = NBFeatures(1)
p = Pipeline([
    ('tfidf', tfidf),
    ('nb', nb),
    ('lr', lr)
])
     
In [ ]:
     
class_names = ['stars_1', 'stars_2', 'stars_3', 'stars_4', 'stars_5']
scores = []
preds = np.zeros((len(test_samp), len(class_names)))
for i, class_name in enumerate(class_names):
    train_target = train_samp[class_name]    
    cv_score = np.mean(cross_val_score(estimator = p, X = train_samp['text'].values, 
                                      y = train_target, cv = 3, scoring = 'accuracy'))
    scores.append(cv_score)
    print('CV score for class {} is {}'.format(class_name, cv_score))
    p.fit(train_samp['text'].values, train_target)
    preds[:,i] = p.predict_proba(test_samp['text'].values)[:,1]
     
    
    
CV score for class stars_1 is 0.9282499819604656
CV score for class stars_2 is 0.90339283521352
CV score for class stars_3 is 0.8591786654537303
CV score for class stars_4 is 0.7321071676830603
CV score for class stars_5 is 0.

CSC 180-01/02 Intelligent Systems (Fall 2021) Project 1: Yelp Business Rating Prediction using Tensorflow Due at 11:59 pm, Thursday, September 30, 2021 Peer Review: class time, Friday, October 1, 2021...

Answer To: CSC 180-01/02 Intelligent Systems (Fall 2021) Project 1: Yelp Business Rating Prediction using...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment