Microsoft Word - as1-grading-part1.doc 1 CPS 470/570: Computer Networks Assignment #1-part 1 Notes on Grading 1. Part 1 (25 points): Download one webpage, one week 1.1. Sample runs: Your program must...

you need to request one web page in your program.


Microsoft Word - as1-grading-part1.doc 1 CPS 470/570: Computer Networks Assignment #1-part 1 Notes on Grading 1. Part 1 (25 points): Download one webpage, one week 1.1. Sample runs: Your program must accept a single command-line argument with a target URL, e.g.: hw1.exe http://www.yahoo.com Output: Another example: hw1.exe http://128.194.135.11?viewcart.php/ 1.2. Grading chart 2 Microsoft Word - as1-webclient.doc 1 CPS 470/570: Computer Networks and Security Programming Assignment #1, 100 pts At most four students in a team. One submission per team No late submission will be accepted Receive 5 bonus points if turn in the complete work without errors at least one day before deadline Receive an F for this course if any academic dishonesty occurs 1. Purpose This homework builds an understanding of the Application Layer, Winsock programming, and multithreaded programming. 2. Description Your project will read multiple URLs from an input file using a single thread. To ensure polite- ness, you will need to hit only unique IPs. To avoid hanging up the code on slow downloads, you will also have to abort all pages that take longer than 10 seconds or occupy more than 2 MB. You may lose points for copy-pasting the same function (with minor changes) over and over again, for writing poorly designed or convoluted code, not checking for errors in every API you call, and allowing buffer overflows, access violations, debug-assertion failures, heap corruption, synchroniza- tion bugs, memory leaks, or any conditions that lead to a crash. Furthermore, your program must be robust against unexpected responses from the Internet and deadlocks. 2.1. Single Threaded Crawling Code with URL-input-100.txt The program must accept two arguments. The first argument indicates the number of threads to run and the second one the input file: as1.exe 1 URL-input-100.txt If the number of threads does not equal one, you should reject the parameters and report usage information to the user. Similarly, if the file does not exist or cannot be successfully read, the program should complain and quit. Assuming these checks pass, you should load the file into RAM and split it into individual URLs (one line per URL). You can use fopen, fgets, fclose (or their ifstream equivalents) to scan the file one-line-at-a-time. A faster approach is load the entire input into some buffer and then separately determine where each line ends. Use C-style fread or an even-faster ReadFile for this. Make sure that only unique hosts make it to gethostbyname and that only unique IP make it to request: Parse URL � Check host is unique � DNS lookup (to get IP address) � Check IP is unique � Request robots.txt (to request a header only) � Check HTTP code � Request page (entire file) � Check HTTP code � Parse page 2 Note that robot.txt existence should be verified using a HEAD request, which ensures that you re- ceive only a header rather than an entire file. That is, in a HEAD request, your program requests the robots.txt file, i.e., "/robots.txt" file. In the reply from server, Code 200 indicates that the ro- bots.txt file does exist and that the website does not allow unrestricted crawling, which means we should NOT request the web page from this web server. For more information, here is the link: http://www.robotstxt.org/orig.html Your printouts should begin with indication that you read the file and its size, followed by the fol- lowing trace: Uniqueness-verification steps and the robots phase are highlighted in bold. You should have a func- tion that connects to a server, downloads a given URL, and verifies the HTTP header. You can simp- ly call this function twice to produce both robots and page-related statistics. The function needs to accept additional parameters that specify a) the HTTP method (i.e., HEAD for robots.txt or GET for the entire page); b) valid HTTP codes; c) maximum download size (i.e., 2 MB for pages, 16 KB for robots.txt); and d) presence of an asterisk in the output. If any of the steps fails, you should drop the current URL and move on to the next: Opened URL-input-100.txt URL: http://www.symantec.com/verisign/ssl-certificates Parsing URL... host www.symantec.com, port 80 Checking host uniqueness... passed Doing DNS... done in 139 ms, found 104.69.239.70 Checking IP uniqueness... passed Connecting on robots... done in 5 ms Loading... done in 57 ms with 213 bytes Verifying header... status code 200 URL: http://www.weatherline.net/ Parsing URL... host www.weatherline.net, port 80 Checking host uniqueness... passed Doing DNS... done in 70 ms, found 216.139.219.73 Checking IP uniqueness... passed Connecting on robots... done in 11 ms Loading... done in 61 ms with 179 bytes Verifying header... status code 404 * Connecting on page... done in 3020 ms Loading... done in 87 ms with 10177 bytes Verifying header... status code 200 + Parsing page... done in 0 ms with 16 links URL: http://abonnement.lesechos.fr/faq/ Parsing URL... host abonnement.lesechos.fr, port 80 Checking host uniqueness... passed Doing DNS... done in 1 ms, found 212.95.72.31 Checking IP uniqueness... passed Connecting on robots... done in 138 ms Loading... done in 484 ms with 469 bytes Verifying header... status code 404 * Connecting on page... done in 4335 ms Loading... done in 899 ms with 57273 bytes Verifying header... status code 200 + Parsing page... done in 1 ms with 63 links URL: http://allafrica.com/stories/201501021178.html Parsing URL... host allafrica.com, port 80 Checking host uniqueness... failed URL: http://architectureandmorality.blogspot.com/ Parsing URL... host architectureandmorality.blogspot.com, port 80 Checking host uniqueness... passed Doing DNS... done in 19 ms, found 216.58.218.193 Checking IP uniqueness... failed URL: http://aviation.blogactiv.eu/ Parsing URL... host aviation.blogactiv.eu, port 80 Checking host uniqueness... passed Doing DNS... done in 218 ms, found 178.33.84.148 Checking IP uniqueness... passed Connecting on robots... done in 9118 ms Loading... failed with 10060 on recv 3 In the last example, the downloaded page does not result in success codes 2xx, which explains why parsing was not performed. As the text may scroll down pretty fast, you can watch for * and + to eas- ily track how often the program attempts to load the target page and parse HTML, respectively. Basic operation of Winsock is covered in class and sample code. Additional caveats are discussed next. 2.2. Required HTTP Fields The general URL format is given by: scheme://[user:pass@]host[:port][/path][?query][#fragment] No need to download a page if scheme is https. No need to parse username/password in this as- signment. You should extract host, port number, path, and query. For instance: Given URL http://cs.udayton.edu:467?addrbook.php, parse URL... host is cs.udayton.edu, port is 467, path is empty, query is ?addrbook.php. Given URL http://138.194.135.11?viewcart.php/, parse URL... host 138.194.135.11, port 80, path is /, query is ?viewcart.php URL: http://zjk.focus.cn/ Parsing URL... host zjk.focus.cn, port 80 Checking host uniqueness... passed Doing DNS... done in 1135 ms, found 101.227.172.52 Checking IP uniqueness... passed Connecting on robots... done in 367 ms Loading... done in 767 ms with 140 bytes Verifying header... status code 403 * Connecting on page... done in 3376 ms Loading... failed with slow download URL: http://azlist.about.com/a.htm Parsing URL... host azlist.about.com, port 80 Checking host uniqueness... passed Doing DNS... done in 81 ms, found 207.126.123.20 Checking IP uniqueness... passed Connecting on robots... done in 5 ms Loading... failed with exceeding max URL: http://apoyanocastigues.mx/ Parsing URL... host apoyanocastigues.mx, port 80 Checking host uniqueness... passed Doing DNS... done in 57 ms, found 23.23.109.126 Checking IP uniqueness... passed Connecting on robots... done in 49 ms Loading... done in 2131 ms with 176 bytes Verifying header... status code 404 * Connecting on page... done in 3051 ms Loading... failed with exceeding max URL: http://ba.voanews.com/media/video/2563280.html Parsing URL... host ba.voanews.com, port 80 Checking host uniqueness... passed Doing DNS... done in 11 ms, found 128.194.178.217 Checking IP uniqueness... passed Connecting on robots... done in 2 ms Loading... done in 490 ms with 2436 bytes Verifying header... status
May 22, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here