CS425/591 Individual Project 3: Motif Search on AWS Problem: Implement the median-string-based Motif search algorithm (as explained in class, see also the attached slides) using MapReduce on AWS. You...

I am not sure how write the code for this.


CS425/591 Individual Project 3: Motif Search on AWS Problem: Implement the median-string-based Motif search algorithm (as explained in class, see also the attached slides) using MapReduce on AWS. You can choose either to use your self- installed Hadoop platform on AWS or the readily available EMR on AWS. Input: Motif length l = 8 and a sequence data file, named “promoters_data_clean.txt”, is attached which consists of 106 sequences (each in a separate line). Output: The output of your program must include the following items (each in a separate column): • the motif, i.e., the found consensus string (also called median string) which is the candidate having the minimum total matching distance (obviously, this shall be the same for all input sequences), • the best match of the motif found in each input sequence, • the sequence’s id (i.e., the line number of the sequence in the input file), • the local matching distance (i.e., the distance between the motif and the best match found in each sequence), • the position index of the best math found in each input sequence (note: index starts from position 1), • the minimum total matching distance (this is the same for all input sequences). A sample output is shown below: Hints: The most vital thing of applying the MapReduce framework to real world problems is to identify what the keys and values are. While there are more advanced approaches, the following hint is a naïve method for inspiring your creativity (don’t refrain your creativity within the frame of this naïve method!!). You can use the candidate median strings (of a total of 65536) as the keys, and the total matching distances of the respective candidates as the values. That means you will not get the keys from input but generate the keys (i.e., enumerating the candidate median strings) through your code on the fly. Your Map function outputs each median string paired with its total matching distance; your Reduce function reverses each key/value pair such as . The output of Reduce will be a sorted list of the reversed pairs and the first pair has the minimum total matching distance and the motif you have found. Then you can start a second round M/R to produce the required final output as specified above. Submission Requirement: Your submission must include the following: (1) The required output as described above (in a separate text file). (2) The source code of your programs (only the code that you wrote) with necessary comments. (3) A short project report including a brief description of your program structure with discussion/comment (short or long) on any issues you fell worth mentioning. Please address the following: • Briefly describe your resource configuration (e.g., number of compute nodes) and collect the execution time of your program. • In addition to the default arrangement of MapReduce, you are encouraged to try alternative task/data splitting schemes. What is a better/best splitting scheme for this project? • With regard to scalability, if the sequence dataset increase by 10 times, will your program take 10 times longer, and why? (4) A short video demo (no more than 5 minutes). Please submit all required files to the project folder at mycourses.siu.edu. CS425/591 Individual Project 3: Motif Search on AWS
Mar 28, 2022
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here