Need solutions to this ASAP{ "cells": [ { "cell_type": "markdown", "metadata": {}, ...

Question

Need solutions to this ASAP{  "cells": [   {    "cell_type": "markdown",    "metadata": {},    "source": [     "# PS1: Perceptron Classifier for Native Language Identification"    ]   },   {    "cell_type": "code",    "execution_count": 5,    "metadata": {},    "outputs": [],    "source": [     "import sys, os, glob
",     "
",     "from collections import Counter
",     "from math import log
",     "from numpy import mean
",     "import numpy as np
",     "
",     "from nltk.stem.wordnet import WordNetLemmatizer"    ]   },   {    "cell_type": "markdown",    "metadata": {},    "source": [     "The Native Language Identification (NLI) task is to predict, given an English document written by an English as a second language (ESL) learner, what is the author's native language. This can be done with better-than-chance accuracy because the native language leaves traces in the way one uses a second language. For this assignment you will complete a multiclass perceptron implementation for the NLI classification task.
",     "
",     "ETS has released a dataset of essays written by native speakers of 16 languages. The documents have already been tokenized and split into train, dev, and test sets (separate directories).
",     "
",     "Your code should go in the directory that has train/, dev/, and test/ as subdirectories. Please do not rename any of the files or directories I have given you.
",     "
",     "Complete the missing portions of the starter code for training, predicting with, and evaluating the model.
",     "
",     "A Python assert statement checks whether a condition that should be true is actually true, and if it isn’t, raises an AssertionError. Keep the assert statements that are in the starter code: they are there to help catch some common bugs.
",     "
",     "Coding convention: Whenever pairing a data point’s gold label and prediction, I suggest using the abbreviations gold and pred, respectively, and always putting the gold label first.
",     "
",     "## Submission Instructions
",     "
",     "After completing the exercises below, generate a pdf of the code **with** outputs. After that create a zip file containing both the completed exercise and the generated PDF. You are **required** to check the PDF to make sure all the code **and** outputs are clearly visible and easy to read. If your code goes off the page, you should reduce the line size. I generally recommend not going over 80 characters.
",     "
",     "Finally, name the zip file using a combination of your the assigment and your name, e.g., ps1_rios.zip"    ]   },   {    "cell_type": "code",    "execution_count": 6,    "metadata": {},    "outputs": [],    "source": [     "# Evaluation Code
",     "from collections import Counter
",     "
",     "class Eval:
",     "    def __init__(self, gold, pred):
",     "        assert len(gold)==len(pred)
",     "        self.gold = gold
",     "        self.pred = pred
",     "
",     "    def accuracy(self):
",     "        numer = sum(1 for p,g in zip(self.pred,self.gold) if p==g)
",     "        return numer / len(self.gold)"    ]   },   {    "cell_type": "markdown",    "metadata": {},    "source": [     "## Problem 1
",     "
",     "a) How many documents are there for each language in the training set? the dev set? (Make a table.) [3 pts]
"    ]   },   {    "cell_type": "code",    "execution_count": 7,    "metadata": {},    "outputs": [],    "source": [     "# This function loads and tokenizes the text for you.
",     "def load_docs(direc, lemmatize, labelMapFile='labels.csv'):
",     "    """Return a list of word-token-lists, one per document.
",     "    Words are optionally lemmatized with WordNet."""
",     "
",     "
",     "    labelMap = {}   # docID => gold label, loaded from mapping file
",     "    with open(os.path.join(direc, labelMapFile)) as inF:
",     "        for ln in inF:
",     "            docid, label = ln.strip().split(',')
",     "            assert docid not in labelMap
",     "            labelMap[docid] = label
",     "
",     "    # create parallel lists of documents and labels
",     "    docs = []
",     "    labels = []
",     "    for file_path in glob.glob(os.path.join(direc, '*.txt')):
",     "        filename = os.path.basename(file_path)
",     "        with open(file_path) as iFile:
",     "            docs.append(iFile.read().strip().split())
",     "            labels.append(labelMap[filename.split('.')[0]])
",     "        
",     "    return docs, labels"    ]   },   {    "cell_type": "code",    "execution_count": 4,    "metadata": {},    "outputs": [],    "source": [     "documents, labels = load_docs('./data/train/', False, 'labels.csv')
",     "documents, labels = load_docs('./data/dev/', False, 'labels.csv')
",     "# WRITE CODE HERE
"    ]   },   {    "cell_type": "markdown",    "metadata": {},    "source": [     "Write answer here"    ]   },   {    "cell_type": "markdown",    "metadata": {},    "source": [     "
",     "b) What would be the majority class baseline accuracy on the dev set, i.e., what is the accuracy if you always predict the most frequent class? [3 pts]"    ]   },   {    "cell_type": "markdown",    "metadata": {},    "source": [     "Write answer here"    ]   },   {    "cell_type": "markdown",    "metadata": {},    "source": [     "## Problem 2
",     "
",     "Implement the perceptron algorithm without averaging or early stopping.
",     "
",     "The maximum number of iterations to run is specified as a method parameter. (However, do stop before the specified number of iterations if the training data are completely separated, i.e., an iteration proceeds without any errors/updates.) As a baseline featureset, implement bias features and (binary) unigram features. 
",     "
",     "Run training for up to 30 iterations, tracking train and dev set accuracy after each iteration. I suggest printing to stderr. With my implementation, the first 5 lines printed are:
",     "
",     "```
",     "5366 training docs with 154.68747670518076 percepts on avg
",     "598 dev docs with 154.81939799331104 percepts on avg
",     "604 test docs with 153.6341059602649 percepts on avg
",     "iteration: 0 updates=2980, trainAcc=0.4446515095042862, devAcc=0.560200668896321, params=110837
",     "iteration: 1 updates=1176, trainAcc=0.7808423406634365, devAcc=0.540133779264214, params=127017
",     "```
",     "
",     "(You don't necessarily need to reproduce this output exactly; it is just for illustration.)
",     "
",     "[code: 18 pts]"    ]   },   {    "cell_type": "code",    "execution_count": 6,    "metadata": {},    "outputs": [],    "source": [     "def extract_feats(doc):
",     "    """
",     "    Extract input features (percepts) for a given document.
",     "    Each percept is a pairing of a name and a boolean, integer, or float value.
",     "    A document's percepts are the same regardless of the label considered.
",     "    :doc list: list of strings, e.g., ['the', 'fat', 'cat']
",     "    :return Counter: Counter of word-count (key-value) pairs.
",     "    """
",     "    ff = Counter()
",     "    # Write code to transform doc into Count BoW features
",

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PS1: Perceptron Classifier for Native Language Identification" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {},...

Get Answer To This Question

Related Questions & Answers

Submit New Assignment