Assignment Goals The goal of this assignment is to gather new datasets develop data collection and data processing tools for collection of data from new sources such as from social media sites,...

Assignment Goals

The goal of this assignment is to gather new datasets develop data collection and data processing tools for collection of data from new sources such as from social media sites, websites and other online sources.

The data gathered will then be processed and stored using a variety of tools and approaches covered in the course material and other resources to allow for further data analytics to be applied at a later point.

Data Storage and Processing

Students are encouraged to leverage multiple data sources such as, files, Databases (Relation/NoSQL/Graph) and online Sources such as web scrapped data, and data accessed via APIs or Open Data. Marks will be awarded for the variety and complexity of the data sources used. Data augmentation, to enrich or improve the quality of the data for future analysis, is encouraged.

Data size and complexity

The assignment will require the use of scalable approaches such as processing with Spark and file storage HDFS/Parquet/ORC, however the size of the dataset and complexity of the training will not be an important factor for grading. Students do not need to choose a large dataset that will require the use of lots of compute power and processing nodes, it is only necessary to use an approach that could scale if the dataset were larger.

Marking considerations:

— Using different sources (e.g. Reddit API, Twitter API, Yahoo Finance – stock prices and news, Competitor Analysis – https://craft.co/ , Finance News – https://www.wsj.com/ )

— Select choice of databases (SQL/NoSQL/Graph)

— Data gathered from online sources (scraping websites of data mining from APIs)

— Combing and merging data from multiple sources

— Use of a Machine Learning pipeline for training and deployment and serving with an API

— Good use data storage formats – e.g. Avaro / Parquet etc.

— Appropriate use of relevant theoretical models and recognised industrial best-practice, concepts and frameworks

— Using a pipeline capable of being distributed to improve scalability such as Apache Spark and suitable data storage formats

— Using or defining appropriate Data Schemas and Storage approaches to enable a wide range of data analysis to be applied

— Appropriate approaches to environment repeatability using source control such as Git and Docker containers

— Use of automation scripts for automating aspects of the solution including shell scripts, python scripts and cloud deployment approaches such as terraform or alternatives

— Use of Data Lineage for tracking data origins and any transformations applied (Marquez/W3C Prov/Apache Atlas or others)

— The usability of the approach for the capture, extraction of datasets for use by others

— Apply simple machine learning onto the data

Mar 28, 2021

SOLUTION.PDF

Assignment Goals The goal of this assignment is to gather new datasets develop data collection and data processing tools for collection of data from new sources such as from social media sites,...

Get Answer To This Question

Related Questions & Answers

Submit New Assignment