The file http://cse151a.com/data/yelp/train.csv contains 10,000 Yelp reviews along with the score the user left (from 1 to 5, with 5 being the best). In this plus problem, you’ll train an SVM to do...


The file http://cse151a.com/data/yelp/train.csv contains 10,000 Yelp reviews along with the score the user left (from 1 to 5, with 5 being the best). In this plus problem, you’ll train an SVM to do sentiment analysis on these reviews and predict the sentiment of an unlabeled piece of text.






Split the data 75%/25% into training and validation sets, encode the training data using a bag of words feature representation, and train a (linear, soft-margin) support vector machine. When training, consider any review with a score or 4 or higher to be a positive review, and anything with a smaller score to be a negative review. Find the value of C that minimizes the error of your classifier on the validation set and make a plot of the validation error as a function of C.


For this part, turn in four things:


1. the value of C that was best,


2. the training and validation error that corresponded to this choice of C,


3. your plot, and


4. your code.


You can use whatever machine learning libraries you like in whatever language you’d like. Note that most languages have libraries which will do the bag-of-words encoding for you. For instance, sklearn has this feature (but I’ll let you Google for it!).


Is the data in train.csv linearly separable? How do you know?

May 14, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here