Microsoft Word - assignment.docx Project Aim Statement: Goal - My project, aimed at achieving a long-term goal in artificial intelligence, is to build a "multimodal" neural network which is a part of...

1 answer below »
Please read the assignment carefully and each line of code must be commented in python


Microsoft Word - assignment.docx Project Aim Statement: Goal - My project, aimed at achieving a long-term goal in artificial intelligence, is to build a "multimodal" neural network which is a part of one of the research areas in AI. Multimodal is an AI system that learns concepts in multiple modalities, primarily text and images, in order to better understand the world. Objective - To develop hypotheses and working concepts to achieve multimodal inference for scenarios where questions are input via text to images or videos that the model can adequately answer. The hypothesis will be based on Visual Question Answering (VQA). Visual Question Answering (VQA) is a task that combines computer vision, natural language processing, and deep learning. VQA is the phenomenon of freely asking questions in natural language about visual (image/video) content. However, answering these questions requires a wide range of skills. These skills include proper localization and recognition of objects, people, their activities, and common sense. Task - Given an image, a visual question-answering algorithm allows the machine to answer free-form, Open- ended, natural-language questions about the image. Approach towards Task: Baseline Model – Natural language processing (NLP) strategy for converting a text document into numbers that can be used by a computer program, therefore BOW Q model could be the best as a baseline model BOW Q (Bag of words) Visual Features - Convolutional Neural Networks (CNN) commonly used for an image classification task, given below have been chose for visual representation. § ResNet § Inception Language Features – Recurrent Neural Network (RNN) is a state-of-the-art deep learning algorithm used for modeling sequential information. § LSTM Q § Bi-LSTM Q Fusion Model - The information from the text and image encoders are fused into a combined representation to perform the downstream task. Late fusion or decision level fusion can be implemented due to its feature of computing separately shape/colour and concatenated. § Fuse B + MLP Staggered Aim: As my area of interest in this project grew, the above-mentioned approach has been planned to implement in terms of achieving the goal of the project. However, I would also like to propose a technique of Deep Modular Co-Attention Networks for Visual Question Answering as an alternative if any bias issues might be faced while progressing. In order to prevent biases, I will be taking it into special consideration and will immediately terminate by applying the alternative plan or combination of other approaches to make the final decision. § The expected outcome is that the multimodal reasoning model outperforms the baseline in terms of accuracy without any bias. § As given an overview of Staggered aim which is a state-of-art approach and performed better than the prior introduced approach (e.g.: SOTA VQA) and proposed with better accuracy. Expected outcome from this approach will be definitely filling the requirement of the Objective as it is deep and dense followed by the 6-layer MCAN with encoder-decoder strategy. § Model will be evaluated based on validation, optimal performance approach and accuracy without any bias to fulfill the requirement of the task. Data link: - The data can be chosen from any available sources such as Kaggle, Hugging Face, VQA, COCO. 1. Suitable computer vision task/s (object detection, object classification, face detection) to create representation using various facets (multifaceted image representation) along with question representation followed by answer classification network. 2. Co-attention model to generate question attention and image attention; one attention model guides the other. Encoding of questions can be done at various levels such as word level, phrase level or question level. Outcome: - expected would be working model which will take in Question through text about Image/video. It will respond to the question by performing natural language processing and object detection/classification as necessary. Analysis of the model (including accuracy) against baseline is also expected. Expected Outcomes | Deliverables:
Answered 1 days AfterAug 29, 2022

Answer To: Microsoft Word - assignment.docx Project Aim Statement: Goal - My project, aimed at achieving a...

Aditi answered on Aug 30 2022
61 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here