# Assignment: Data Management and Regression Analysis

## Concept

An important way to test the relationship between two variables

Y

and

X

is to run the model:

Y = a + bX

using ordinary least squares (OLS) from statsmodels.formula.api package. When we run regressions, we not only estimate the parameters

a

and

b

that can then be used for predictions, we also get to understand how well the model fits (i.e., how much of the variance in

Y

is explained by

a+bX).

There are many critical issues such as the selection and measurement of the

X

and

Y

variables. For example, are the variables scaled properly? How to select

X

variables?

Common sense and business knowledge can often guide you in the proper direction, but one also has to smartly use exploratory data analysis (EDA).

In this assignment, you will apply such analyses to understand the relationship between audit fees (the Y variable) and financial characteristics of a firm (the X variables).

You are provided with data from two separate sources: Assignment 4 OL AuditFees201019 contains audit fee information from the Audit Analytics database Assignment 4 OL Compustat201019 contains financial characteristics of firms from the Compustat Annual Industrial file.

You can find some of the variables defined in the Compustat database.

Requirements You are expected to conduct library research (search www.scholar.google.com using keywords such as audit fees) to gain an understanding of variables affecting audit fees. The main requirement is that you identify and demonstrate a model explaining audit fees (Y) using firm characteristics (X). Please use OLS. Use EDA as well as business judgment to identify the best set of X variables. In short, demonstrate skill in feature engineering. Demonstrate pandas skill and ability in data acquisition, data cleaning, data management, and analysis. Demonstrate advanced ability in reporting using a Jupyter notebook. Recall that an analytics report has many components (see spec. sheet for previous projects as well as the list below). You are expected to showcase increasing skill in reporting as you make progress in the course. An analytics report has many components such as: An introduction that discusses the scope of the analysis A description of data used in the analysis along with data cleaning procedures Code that clearly shows how an algorithm is implemented Results Discussion of results and generation of insight when appropriate Summary when appropriate Submission Submit a pdf as before. The total length should not exceed 10 pages.

Answered 3 days AfterMar 24, 2023

## Answer To: Assignment: Data Management and Regression AnalysisConceptAn important way to test the relationship...

Amar Kumar answered on Mar 26 2023
Introduction:
In this assignment, we will be exploring the relationship between audit fees (the Y variable) and financial characteristics of a firm (the X variables). We will be using data from two separate sources: AuditFees201019 and Compustat201019. AuditFees201019 contains information on audit fees from the Audit Analytics database, while Compustat201019 contains financial characteristics of firms from the Compustat Annual Industrial file.
Our
goal is to identify and demonstrate a model that explains audit fees (Y) using firm characteristics (X). To achieve this, we will use OLS (ordinary least squares) from the statsmodels.formula.api package. OLS is a commonly used method for testing the relationship between two variables. By running the model Y = a + bX, we can estimate the parameters a and b, which can be used for predictions. We can also understand how well the model fits, or how much of the variance in Y is explained by a+bX.
However, selecting and measuring the X and Y variables properly is crucial. We need to consider issues such as whether the variables are scaled properly and how to select the X variables. Common sense and business knowledge can often guide us in the proper direction, but we also need to use exploratory data analysis (EDA) smartly.
Data Description:
We have two datasets for this project. The first dataset, AuditFees201019, contains information on audit fees paid by companies to their external auditors. The dataset has the following columns:
· FISCAL_YEAR: The fiscal year in which the audit was conducted.
· FISCAL_YEAR_ENDED: The fiscal year in which the company's financial statements were prepared.
· AUDIT_FEES: The audit fees paid by the company to its external auditor.
· AUDITOR_NAME: The name of the external auditor.
· COMPANY_FKEY: A unique identifier for each company.
· BEST_EDGAR_TICKER: The ticker symbol for the company.
The second dataset, Compustat201019, contains financial characteristics of firms. The dataset has the following columns:
· popsrc: Population source.
· datafmt: Data format.
· tic: Ticker symbol.
· conm: Company name.
· curcd: Currency code.
· act: Current assets.
· at: Total assets.
· ceq: Common equity.
· ebit: Earnings before interest and taxes.
· ebitda: Earnings before interest, taxes, depreciation, and amortization.
· emp: Number of employees.
· invt: Inventory.
· lct: Current liabilities.
· pifo: Property, plant, and equipment, net.
· exchg: Exchange code.
· costat: Active/Inactive status.
· fic: Foreign incorporation code.
Data Cleaning and Preparation
1. Data Acquisition:
The first step in data cleaning and preparation is to acquire the data. In this case, you have two separate datasets - one containing audit fee information from the Audit Analytics database, and one containing financial characteristics of firms from the Compustat Annual Industrial file. You will need to read these datasets into your Jupyter notebook using pandas' read_csv() function or any other appropriate method.
2. Data Cleaning:
Once you have acquired the data, the next step is to clean it. This involves identifying and correcting any errors, inconsistencies, or missing data. Some common data cleaning tasks you may need to perform include:
· Checking for missing values: You will need to check each variable in your dataset for missing values and decide how to handle them. You may choose to impute missing values with a mean or median value, drop rows with missing values, or use a more sophisticated method.
· Checking for outliers: Outliers can have a significant impact on the results of your analysis. You will need to identify and decide how to handle outliers in your dataset.
· Checking for duplicates: Duplicates can also affect the results of your analysis. You will need to check for and remove any duplicate rows in your dataset.
· Formatting data: You may need to format the data in your dataset to ensure that it is in the correct format for analysis. For example, you may need to convert strings to numeric values, or convert dates to a standardized format.
3. Data Preparation:
Once your data is clean, you can begin preparing it for analysis. This involves selecting and transforming the variables in your dataset to create the X and Y variables for your regression analysis. Some common data preparation tasks you may need to perform include:
· Feature selection: You will need to select the X variables that you want to use in your regression analysis. This should be based on your...
SOLUTION.PDF