Apache Spark: Joining Datasets and Performing Aggregations on Grouped Data 1. Flight Data In the first part of this exercise, you will load flight data from the domestic-flights/flights.parquet file...

1 answer below »

Joining Datasets and Performing Aggregations on Grouped Data with Apache Spark. There are five different parts of joining data, renaming and removing columns, join to destination airport, top ten airport analysis and provide user defined functions.






Apache Spark: Joining Datasets and Performing Aggregations on Grouped Data 1. Flight Data In the first part of this exercise, you will load flight data from the domestic-flights/flights.parquet file and airport codes from the airport-codes/airport-codes.csv file. As a reminder, you can load a parquet and CSV files as follows. (Click on the image below to download the code.) As a first step, load both files into Spark and print the schemas. The flight data uses the International Air Transport Association (IATA) codes of the origin and destination airports. The IATA code is a three-letter code identifying the airport. For instance, Omaha’s Eppley Airfield is OMA, Baltimore-Washington International Airport is BWI, Los Angeles International Airport is LAX, and New York’s John F. Kennedy International Airport is JFK. The airport codes file contains information for each of the airports. a. Join the Data Join the flight data to airport codes data by matching the IATA code of the originating flight to the IATA code in the airport codes file. Note that the airport codes file may not contain IATA codes for all of the origin and destination flights in the flight data. We still want information on those flights even if we cannot match it to a value in the airport codes file. This means you will want to use a left join instead of the default inner join. Print the schema of the joined dataframe. b. Rename and Remove Columns Next, we want to rename some of the joined columns and remove unneeded columns. Remove the following columns from the joined dataframe. __index_level_0__ ident local_code continent iso_country iata_code Rename the following columns. type: origin_airport_type name: origin_airport_name elevation_ft: origin_airport_elevation_ft iso_region: origin_airport_region municipality: origin_airport_municipality gps_code: origin_airport_gps_code coordinates: origin_airport_coordinates c. Join to Destination Airport Repeat parts a and b joining the airport codes file to the destination airport instead of the origin airport. Drop the same columns and rename the same columns using the prefix destination_airport_ instead of origin_airport_. Print the schema of the resultant dataframe. The final schema and dataframe should contain the added information (name, region, coordinate, …) for the destination and origin airports. d. Top Ten Airports Create a dataframe using only data from 2008. This dataframe will be a report of the top ten airports by the number of inbound passengers. This dataframe should contain the following fields: Rank (1-10) Name IATA code Total Inbound Passengers Total Inbound Flights Average Daily Passengers Average Inbound Flights Show the results of this dataframe using the show method. e. User Defined Functions The latitude and longitude coordinates for the destination and origin airports are string values and not numeric. You will create a user-defined function in Python that will convert the string coordinates into numeric coordinates. Below is the Python code that will help you create and use this user-defined function. (Click on the image below to download the code.) Add new columns for destination_airport_longitude, destination_airport_latitude, origin_airport_longitude, and origin_airport_latitude. Provide the results via your notebook or export your code for each Exercise. df_flights = spark.read.parquet(flights_parquet_path) df_airport_codes = spark.read.load( airport_codes_path, format="csv", sep=",", inferSchema=True, header=True ) from pyspark.sql.functions import udf @udf('double') def get_latitude(coordinates): split_coords = coordinates.split(',') if len(split_coords) != 2: return None return float(split_coords[0].strip()) @udf('double') def get_longitude(coordinates): split_coords = coordinates.split(',') if len(split_coords) != 2: return None return float(split_coords[1].strip()) df = df.withColumn( 'destination_airport_longitude', get_longitude(df['destination_airport_coordinates']) )
Answered Same DaySep 17, 2021

Answer To: Apache Spark: Joining Datasets and Performing Aggregations on Grouped Data 1. Flight Data In the...

Ximi answered on Sep 20 2021
150 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here