CSCE 463/612: Networks and Distributed Processing Homework 1 Part 2 (25 pts) Due date: 9/10/19 1. Problem Description You will now expand into downloading multiple URLs from an input file using a...

Assignment should be coded in C/C++ and made to run on a windows system. The assignment starts with part 1 and then part 2 builds from the first part. Need both parts programmed.


CSCE 463/612: Networks and Distributed Processing Homework 1 Part 2 (25 pts) Due date: 9/10/19 1. Problem Description You will now expand into downloading multiple URLs from an input file using a single thread. To ensure politeness, you will hit only unique IPs and check the existence of robots.txt. Robot exclusion is an informal standard that allows webmasters to specify which directories/files on the server are prohibited from download by non-human agents. See http://www.robotstxt.org/ and https://en.wikipedia.org/wiki/Robots_exclusion_standard for more details. To avoid hanging the code on slow downloads, you will also have to abort all pages that take longer than 10 seconds1 or occupy more than 2 MB (for robots, this limit is 16 KB). 1.1. Code (25 pts) The program must now accept either one or two arguments. In the former case, it implements the previous functionality; in the latter case, the first argument indicates the number of threads to run and the second one the input file: hw1.exe 1 URL-input.txt If the number of threads does not equal one, you should reject the parameters and report usage information to the user. Similarly, if the file does not exist or cannot be successfully read, the program should complain and quit. Assuming these checks pass, you should load the file into RAM and split it into individual URLs (one line per URL). You can use fopen, fgets, fclose (or their ifstream equivalents) to scan the file one-line-at-a-time. A faster approach is load the entire input into some buffer and then separately determine where each line ends. Use C-style fread or an even-faster ReadFile for this purpose (the sample HTML-parser project shows us- age of ReadFile). In the former case, note that the file must be opened in binary mode (e.g., us- ing “rb” in fopen) to avoid unnecessary translation that may corrupt the URLs. To avoid redundant DNS lookups, make sure that only unique hosts make it to gethostbyname. Combining this with an earlier discussion of politeness and robots leads to the following logic: Parse URL  Check host is unique  DNS lookup  Check IP is unique  Request robots  Check HTTP code  Request page  Check HTTP code  Parse page Note that robot existence should be verified using a HEAD request. This ensures that you receive only the header rather than an entire file. Codes 4xx indicate that the robot file does not exists and the website allows unrestricted crawling. Any other code should be interpreted as preventing further contact with that host. Your printouts should begin with indication that you read the file and its size, followed by the following trace: 1 Your previous usage of select constrained each recv to 10 seconds, but allowed unbounded delays across the page. In these cases, a website feeding one byte every 9 seconds could drag forever. 1 http://www.robotstxt.org/ https://en.wikipedia.org/wiki/Robots_exclusion_standard Opened URL-input.txt with size 66152005 URL: http://www.symantec.com/verisign/ssl-certificates Parsing URL... host www.symantec.com, port 80 Checking host uniqueness... passed Doing DNS... done in 139 ms, found 104.69.239.70 Checking IP uniqueness... passed Connecting on robots... done in 5 ms Loading... done in 57 ms with 213 bytes Verifying header... status code 200 URL: http://www.weatherline.net/ Parsing URL... host www.weatherline.net, port 80 Checking host uniqueness... passed Doing DNS... done in 70 ms, found 216.139.219.73 Checking IP uniqueness... passed Connecting on robots... done in 11 ms Loading... done in 61 ms with 179 bytes Verifying header... status code 404 * Connecting on page... done in 3020 ms Loading... done in 87 ms with 10177 bytes Verifying header... status code 200 + Parsing page... done in 0 ms with 16 links URL: http://abonnement.lesechos.fr/faq/ Parsing URL... host abonnement.lesechos.fr, port 80 Checking host uniqueness... passed Doing DNS... done in 1 ms, found 212.95.72.31 Checking IP uniqueness... passed Connecting on robots... done in 138 ms Loading... done in 484 ms with 469 bytes Verifying header... status code 404 * Connecting on page... done in 4335 ms Loading... done in 899 ms with 57273 bytes Verifying header... status code 200 + Parsing page... done in 1 ms with 63 links Note that you no longer need to print the request after the port. Uniqueness-verification steps and the robot phase are new and highlighted in bold. If you already have a function that connects to a server, downloads a given URL, and verifies the HTTP header, you can simply call it twice to produce both robots and page-related statistics. The function needs to accept additional parame- ters that specify a) the HTTP method (i.e., HEAD or GET); b) valid HTTP codes (i.e., 2xx for pages, 4xx for robots); c) maximum download size (i.e., 2 MB for pages, 16 KB for robots); and d) presence of an asterisk in the output. If any of the steps fails, you should drop the current URL and move on to the next: URL: http://allafrica.com/stories/201501021178.html Parsing URL... host allafrica.com, port 80 Checking host uniqueness... failed URL: http://architectureandmorality.blogspot.com/ Parsing URL... host architectureandmorality.blogspot.com, port 80 Checking host uniqueness... passed Doing DNS... done in 19 ms, found 216.58.218.193 Checking IP uniqueness... failed URL: http://aviation.blogactiv.eu/ Parsing URL... host aviation.blogactiv.eu, port 80 Checking host uniqueness... passed Doing DNS... done in 218 ms, found 178.33.84.148 Checking IP uniqueness... passed Connecting on robots... done in 9118 ms Loading... failed with 10060 on recv URL: http://zjk.focus.cn/ Parsing URL... host zjk.focus.cn, port 80 Checking host uniqueness... passed Doing DNS... done in 1135 ms, found 101.227.172.52 Checking IP uniqueness... passed Connecting on robots... done in 367 ms Loading... done in 767 ms with 140 bytes 2 Verifying header... status code 403 * Connecting on page... done in 3376 ms Loading... failed with slow download URL: http://azlist.about.com/a.htm Parsing URL... host azlist.about.com, port 80 Checking host uniqueness... passed Doing DNS... done in 81 ms, found 207.126.123.20 Checking IP uniqueness... passed Connecting on robots... done in 5 ms Loading... failed with exceeding max URL: http://apoyanocastigues.mx/ Parsing URL... host apoyanocastigues.mx, port 80 Checking host uniqueness... passed Doing DNS... done in 57 ms, found 23.23.109.126 Checking IP uniqueness... passed Connecting on robots... done in 49 ms Loading... done in 2131 ms with 176 bytes
Sep 06, 2021CSCE 463/612
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here