7/6/2019 COMP9319 2019T2 Assignment 2 https://www.cse.unsw.edu.au/~wong/cs9319-2019a2.html 1/5 COMP9319 2019T2 Assignment 2: RLFM Index (Run-Length Encoded BWT) Your task in this assignment is to...

write it in C language. need someone who is very good at web data compression to work on this.you can use .\ instead of |the code will work fine both in the Linux machine and our cse machine. So it's better if the expert code in Linux rather than Ubuntu.





7/6/2019 COMP9319 2019T2 Assignment 2 https://www.cse.unsw.edu.au/~wong/cs9319-2019a2.html 1/5 COMP9319 2019T2 Assignment 2: RLFM Index (Run-Length Encoded BWT) Your task in this assignment is to create a search program that implements BWT backward search, which can efficiently search a RLFM encoded record file. The original file (before RLFM) format is: [][][]... ... where , , , etc. are integer values that are used as unique record identifiers; and , , , etc. are record values (text), which include any ASCII alphabets with ASCII values from 32 to 126, tab (ASCII 9) and newline (ASCII 10 and 13). For simplicity, there will be no open or close square bracket in the record values. Your C/C++ program, called rlebwt , accepts: 1. a command argument of either: -m for the number of matching substrings (count duplicates), -r for the number of unique matching records, -a for listing the identifiers of all the matching records (no duplicates and in ascending order), or -n for displaying the record value of a given record identifier; 2. the path to a RLFM encoded file (without its file extension); 3. the path to a index folder; and 4. a quoted query string (i.e., the search term) for option -m, -r, -a, or a quoted record identifier for option -n as commandline input arguments. The search term can be up to 512 characters. To make the assignment easier, we assume that the search is case sensitive. If -a is specified, using the given query string, rlebwt will perform backward search on the given RLFM encoded file, and output the sorted and unique identifiers (no duplicates) of all the records that contain the input query string to the standard output. Each identifier is enclosed in a pair of square brackets, one line (ending with a '\n') for each match. If -m is specified, given a query string, rlebwt will output the total number of matching substrings (count duplicates) to the standard output. The output is the total number, with an ending newline character. Similarly, rlebwt will output the total number of unique matching records (do not count duplicates) if -r is specified. If -n is specified, using the given record identifier, rlebwt will output the original record value (text) to the standard output with a '\n' at the end. For any of the above options, if a match cannot be found, simply output nothing. Although you do not need to submit a BWT encoder, it is a part of this assignment that you will implement a simple BWT encoding program based on RLFM (this will help you in understanding the lecture materials and assist in testing your assignment). File Extensions and Formats Sample files are provided in ~cs9319/a2/. wagner % pwd /import/kamen/1/cs9319/a2 wagner % ls -l total 12088 -rw-r--r-- 1 cs9319 cs9319 911184 Jun 27 23:12 dblp.b -rw-r--r-- 1 cs9319 cs9319 911184 Jun 27 23:12 dblp.bb -rw-r--r-- 1 cs9319 cs9319 3132682 Jun 27 23:12 dblp.s -r--r--r-- 1 cs9319 cs9319 7289468 Jun 27 23:12 dblp.txt -rw-r--r-- 1 cs9319 cs9319 3191 Jun 27 22:40 shopping.b 7/6/2019 COMP9319 2019T2 Assignment 2 https://www.cse.unsw.edu.au/~wong/cs9319-2019a2.html 2/5 -rw-r--r-- 1 cs9319 cs9319 3191 Jun 27 22:40 shopping.bb -rw-r--r-- 1 cs9319 cs9319 13700 Jun 27 22:40 shopping.s -r--r--r-- 1 cs9319 cs9319 25525 Jun 27 23:15 shopping.txt -rw-r--r-- 1 cs9319 cs9319 2 Jun 27 23:11 simple1.b -rw-r--r-- 1 cs9319 cs9319 2 Jun 27 23:11 simple1.bb -rw-r--r-- 1 cs9319 cs9319 11 Jun 27 23:11 simple1.s -r--r--r-- 1 cs9319 cs9319 15 Jun 27 23:11 simple1.txt -rw-r--r-- 1 cs9319 cs9319 8 Jun 27 23:11 simple2.b -rw-r--r-- 1 cs9319 cs9319 8 Jun 27 23:11 simple2.bb -rw-r--r-- 1 cs9319 cs9319 35 Jun 27 23:11 simple2.s -r--r--r-- 1 cs9319 cs9319 58 Jun 27 23:11 simple2.txt -rw-r--r-- 1 cs9319 cs9319 70 Jun 27 23:11 simple3.b -rw-r--r-- 1 cs9319 cs9319 70 Jun 27 23:11 simple3.bb -rw-r--r-- 1 cs9319 cs9319 378 Jun 27 23:11 simple3.s -r--r--r-- 1 cs9319 cs9319 553 Jun 27 23:11 simple3.txt wagner % The file extensions represent their corresponding types: FILENAME.txt - the original text file. It is provided for your reference only. It will not be available during auto marking. FILENAME.s - corresponds to S in the RLFM lecture slides and its original paper. It is the BWT text with the consecutive duplicates removed. FILENAME.b - corresponds to the bit array B in the RLFM lecture slides and its original paper. It is in binary format, which can be inspected using xxd as shown later. FILENAME.bb - corresponds to the bit array B' in the RLFM lecture slides and its original paper. It is in binary format, which can be inspected using xxd as shown later. This file is not provided during auto marking. Your rlebwt will need to generate it. For the B and B' arrays, after the last bit is written to the file, fill in the gap (if any) of the last byte with bit 1. Check the xxd examples below for details. Initialization and External Files Whenever rlebwt is executed using a given file FILENAME, for example: rlebwt -X FILENAME INDEX_FOLDER QUERY_STRING where X can be any one of the options (-m, -r, -a, -n), it will take FILENAME.s and FILENAME.b as input; and also check if FILENAME.bb exists. If FILENAME.bb does not exist, it will generate one. After that, it will check if INDEX_FOLDER exists. If not, it will create it as an index folder. Index files will then be generated inside this index folder accordingly. In addition to the B' array, your solution is allowed to write out up to 6 external index files that are in total no larger than the total size of the given, input FILENAME.s file plus 2 x the size of the given FILENAME.b. If your index files are larger than this limit, you will receive zero points for the tests that involve that given FILENAME. You may assume that the index folder (and its index files inside) will not be deleted during all the tests for a given FILENAME, and all the INDEX_FOLDER are uniquely and correspondingly named. Therefore, to save time, you only need to generate the index files when their folder does not exist yet. Example Suppose the original file (say dummy.txt) before RLFM is: [3]Computers in industry[25]Data compression[33]Integration[40]Big data indexing[90]1990-02-19[190]20.55 Some examples: %wagner> rlebwt -m ~/a2/dummy ~/a2/dummyIndex "in" 4 %wagner> rlebwt -r ~/a2/dummy ~/a2/dummyIndex "in" 7/6/2019 COMP9319 2019T2 Assignment 2 https://www.cse.unsw.edu.au/~wong/cs9319-2019a2.html 3/5 2 %wagner> rlebwt -a ~/a2/dummy ~/a2/dummyIndex "in" [3] [40] %wagner> rlebwt -n ~/a2/dummy ~/a2/dummyIndex "3" Computers in industry %wagner> In the above example, we assume dummy.s and dummy.b exist in the a2 folder of our home directory. rlebwt will generate dummy.bb inside a2, and will then create an index folder called dummyIndex (with the index files inside dummyIndex) inside a2 as well. In the following example, we assume dummy.s and dummy.b exist in the XYZ folder of the account MyAccount. You will check if dummy.bb exists in ~MyAccount/XYZ/. If not, your submitted rlebwt will generate dummy.bb in ~MyAccount/XYZ/ (assume you have write permission in that folder). You will create an index folder called dummy (with the index files inside dummy) at your current directory. %wagner> rlebwt -m ~MyAccount/XYZ/dummy dummy "in " 1 %wagner> rlebwt -r ~MyAccount/XYZ/dummy dummy "in " 1 %wagner> %wagner> rlebwt -a ~MyAccount/XYZ/dummy dummy "In" [33] %wagner> %wagner> rlebwt -m ~MyAccount/XYZ/dummy dummy "9" 3 %wagner> rlebwt -r ~MyAccount/XYZ/dummy dummy "9" 1 %wagner> rlebwt -a ~MyAccount/XYZ/dummy dummy "9" [90] %wagner> rlebwt -n ~MyAccount/XYZ/dummy dummy "9" %wagner> %wagner> rlebwt -n ~MyAccount/XYZ/dummy dummy "90" 1990-02-19 %wagner> rlebwt -n ~MyAccount/XYZ/dummy dummy "25" Data compression %wagner> Note that it is possible that your submission may be tested with the B' files provided. For example, the RLFM encoded file path could be ~cs9319/a2/simple1 and path to index folder could be ~/a2/myIndex. Since simple1.bb is already there, you do not need to generate the B' file again and just read and use it from ~cs9319/a2/. You will then generate the index folder called myIndex at your own a2 folder. Inspecting the Binary Files You may find the tool xxd useful to inspect the binary files correspond to the B and B' arrays. For example, you may use xxd to inspect the provided sample files: wagner % pwd /import/kamen/1/cs9319/a2 wagner % wagner % xxd -b simple1.b 0000000: 10111111 11101001 .. wagner % xxd -b simple1.bb 0000000: 11101011 00111111 .? wagner % cat simple1.s [an12nbnb]awagner % wagner % xxd -b simple2.b 0000000: 10000110 10111111 11111111 11111001 00000010 00111100 .....< 0000006: 11100110 10111111 .. wagner % xxd -b simple2.bb 0000000: 11011111 11100001 01000000 11100101 10010011 11111111 ..@... 0000006: 01111100 01111111 |. wagner % cat simple2.s [1[1endgnad1234245ndbnb]ngnabdiaiaiwagner % wagner % 7/6/2019 comp9319 2019t2 assignment 2 https://www.cse.unsw.edu.au/~wong/cs9319-2019a2.html 4/5 in particular, simple1.s has 11 characters. therefore, there will be 11 ones in the array b that correspond to the 11 characters in simple1.s. since all the zeros representing the duplicates of these 11 characters, you can observe that the last 0000006:="" 11100110="" 10111111="" ..="" wagner="" %="" xxd="" -b="" simple2.bb="" 0000000:="" 11011111="" 11100001="" 01000000="" 11100101="" 10010011="" 11111111="" ..@...="" 0000006:="" 01111100="" 01111111="" |.="" wagner="" %="" cat="" simple2.s="" [1[1endgnad1234245ndbnb]ngnabdiaiaiwagner="" %="" wagner="" %="" 7/6/2019="" comp9319="" 2019t2="" assignment="" 2="" https://www.cse.unsw.edu.au/~wong/cs9319-2019a2.html="" 4/5="" in="" particular,="" simple1.s="" has="" 11="" characters.="" therefore,="" there="" will="" be="" 11="" ones="" in="" the="" array="" b="" that="" correspond="" to="" the="" 11="" characters="" in="" simple1.s.="" since="" all="" the="" zeros="" representing="" the="" duplicates="" of="" these="" 11="" characters,="" you="" can="" observe="" that="" the="">
Jul 06, 2021COMP9319
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions ยป

Submit New Assignment

Copy and Paste Your Assignment Here