Assignment Goals The goal of this assignment is to gather new datasets develop data collection and data processing tools for collection of data from new sources such as from social media sites,...



Assignment Goals


The goal of this assignment is to gather new datasets develop data collection and data processing tools for collection of data from new sources such as from social media sites, websites and other online sources.




The data gathered will then be processed and stored using a variety of tools and approaches covered in the course material and other resources to allow for further data analytics to be applied at a later point.






Data Storage and Processing


Students are encouraged to leverage multiple data sources such as, files, Databases (Relation/NoSQL/Graph) and online Sources such as web scrapped data, and data accessed via APIs or Open Data. Marks will be awarded for the variety and complexity of the data sources used. Data augmentation, to enrich or improve the quality of the data for future analysis, is encouraged.





Data size and complexity


The assignment will require the use of scalable approaches such as processing with Spark and file storage HDFS/Parquet/ORC, however the size of the dataset and complexity of the training will not be an important factor for grading. Students do not need to choose a large dataset that will require the use of lots of compute power and processing nodes, it is only necessary to use an approach that could scale if the dataset were larger.





Marking considerations:


— Using different sources (e.g. Reddit API, Twitter API, Yahoo Finance – stock prices and news, Competitor Analysis – https://craft.co/ , Finance News – https://www.wsj.com/ )


— Select choice of databases (SQL/NoSQL/Graph)


— Data gathered from online sources (scraping websites of data mining from APIs)


— Combing and merging data from multiple sources


— Use of a Machine Learning pipeline for training and deployment and serving with an API


— Good use data storage formats – e.g. Avaro / Parquet etc.


— Appropriate use of relevant theoretical models and recognised industrial best-practice, concepts and frameworks


— Using a pipeline capable of being distributed to improve scalability such as Apache Spark and suitable data storage formats


— Using or defining appropriate Data Schemas and Storage approaches to enable a wide range of data analysis to be applied


— Appropriate approaches to environment repeatability using source control such as Git and Docker containers


— Use of automation scripts for automating aspects of the solution including shell scripts, python scripts and cloud deployment approaches such as terraform or alternatives


— Use of Data Lineage for tracking data origins and any transformations applied (Marquez/W3C Prov/Apache Atlas or others)


— The usability of the approach for the capture, extraction of datasets for use by others


— Apply simple machine learning onto the data


Mar 28, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here