{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PS1: Perceptron Classifier for Native Language Identification" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {},...

Need solutions to this ASAP


{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PS1: Perceptron Classifier for Native Language Identification" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import sys, os, glob\n", "\n", "from collections import Counter\n", "from math import log\n", "from numpy import mean\n", "import numpy as np\n", "\n", "from nltk.stem.wordnet import WordNetLemmatizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Native Language Identification (NLI) task is to predict, given an English document written by an English as a second language (ESL) learner, what is the author's native language. This can be done with better-than-chance accuracy because the native language leaves traces in the way one uses a second language. For this assignment you will complete a multiclass perceptron implementation for the NLI classification task.\n", "\n", "ETS has released a dataset of essays written by native speakers of 16 languages. The documents have already been tokenized and split into train, dev, and test sets (separate directories).\n", "\n", "Your code should go in the directory that has train/, dev/, and test/ as subdirectories. Please do not rename any of the files or directories I have given you.\n", "\n", "Complete the missing portions of the starter code for training, predicting with, and evaluating the model.\n", "\n", "A Python assert statement checks whether a condition that should be true is actually true, and if it isn’t, raises an AssertionError. Keep the assert statements that are in the starter code: they are there to help catch some common bugs.\n", "\n", "Coding convention: Whenever pairing a data point’s gold label and prediction, I suggest using the abbreviations gold and pred, respectively, and always putting the gold label first.\n", "\n", "## Submission Instructions\n", "\n", "After completing the exercises below, generate a pdf of the code **with** outputs. After that create a zip file containing both the completed exercise and the generated PDF. You are **required** to check the PDF to make sure all the code **and** outputs are clearly visible and easy to read. If your code goes off the page, you should reduce the line size. I generally recommend not going over 80 characters.\n", "\n", "Finally, name the zip file using a combination of your the assigment and your name, e.g., ps1_rios.zip" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Evaluation Code\n", "from collections import Counter\n", "\n", "class Eval:\n", " def __init__(self, gold, pred):\n", " assert len(gold)==len(pred)\n", " self.gold = gold\n", " self.pred = pred\n", "\n", " def accuracy(self):\n", " numer = sum(1 for p,g in zip(self.pred,self.gold) if p==g)\n", " return numer / len(self.gold)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 1\n", "\n", "a) How many documents are there for each language in the training set? the dev set? (Make a table.) [3 pts]\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# This function loads and tokenizes the text for you.\n", "def load_docs(direc, lemmatize, labelMapFile='labels.csv'):\n", " \"\"\"Return a list of word-token-lists, one per document.\n", " Words are optionally lemmatized with WordNet.\"\"\"\n", "\n", "\n", " labelMap = {} # docID => gold label, loaded from mapping file\n", " with open(os.path.join(direc, labelMapFile)) as inF:\n", " for ln in inF:\n", " docid, label = ln.strip().split(',')\n", " assert docid not in labelMap\n", " labelMap[docid] = label\n", "\n", " # create parallel lists of documents and labels\n", " docs = []\n", " labels = []\n", " for file_path in glob.glob(os.path.join(direc, '*.txt')):\n", " filename = os.path.basename(file_path)\n", " with open(file_path) as iFile:\n", " docs.append(iFile.read().strip().split())\n", " labels.append(labelMap[filename.split('.')[0]])\n", " \n", " return docs, labels" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "documents, labels = load_docs('./data/train/', False, 'labels.csv')\n", "documents, labels = load_docs('./data/dev/', False, 'labels.csv')\n", "# WRITE CODE HERE\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write answer here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "b) What would be the majority class baseline accuracy on the dev set, i.e., what is the accuracy if you always predict the most frequent class? [3 pts]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write answer here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 2\n", "\n", "Implement the perceptron algorithm without averaging or early stopping.\n", "\n", "The maximum number of iterations to run is specified as a method parameter. (However, do stop before the specified number of iterations if the training data are completely separated, i.e., an iteration proceeds without any errors/updates.) As a baseline featureset, implement bias features and (binary) unigram features. \n", "\n", "Run training for up to 30 iterations, tracking train and dev set accuracy after each iteration. I suggest printing to stderr. With my implementation, the first 5 lines printed are:\n", "\n", "```\n", "5366 training docs with 154.68747670518076 percepts on avg\n", "598 dev docs with 154.81939799331104 percepts on avg\n", "604 test docs with 153.6341059602649 percepts on avg\n", "iteration: 0 updates=2980, trainAcc=0.4446515095042862, devAcc=0.560200668896321, params=110837\n", "iteration: 1 updates=1176, trainAcc=0.7808423406634365, devAcc=0.540133779264214, params=127017\n", "```\n", "\n", "(You don't necessarily need to reproduce this output exactly; it is just for illustration.)\n", "\n", "[code: 18 pts]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def extract_feats(doc):\n", " \"\"\"\n", " Extract input features (percepts) for a given document.\n", " Each percept is a pairing of a name and a boolean, integer, or float value.\n", " A document's percepts are the same regardless of the label considered.\n", " :doc list: list of strings, e.g., ['the', 'fat', 'cat']\n", " :return Counter: Counter of word-count (key-value) pairs.\n", " \"\"\"\n", " ff = Counter()\n", " # Write code to transform doc into Count BoW features\n",
Feb 18, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here