This is machine learning subject.See the file called 'Task.pdf' everything that I want to be done in it.python code has to be done. maybe the file called 'Example.ipynb' might be helpful. I attached...

1 answer below »
This is machine learning subject.See the file called 'Task.pdf' everything that I want to be done in it.python code has to be done. maybe the file called 'Example.ipynb' might be helpful.
I attached everything.
make simple1 jupyter notebook file(python coded with captions to make understand easily)1 simple document(Prediction + simple concept of Malicious URLs) - (Deadline is 21/05/2021 before 22nd 12AM(Australian local time)).----------------------------------------------------------------------------------------------------------------------------1 research paper(With IEEE reference) - it has to be done before 26th/May/2021


COMP8325: Applications of Machine Learning in Cyber Security Group Project Description Macquarie University Session 1, 2021 1 Group Project Deadlines Group Project (20%): exploring machine learning methods for cyber security applications. We envisioned the following timeline to track progress of the group projects: • Week 07 (Friday 23 April 11:55PM): group formation due • Week 12 (Monday 24 May 11:55PM): presentation video recording due • Week 13 (Sunday 06 June 11:55PM): final report + implementation source code due 2 Learning Outcomes By completing this project, you should demonstrate your ability to: • Understand and detect abnormal patterns in a variety of real-world datasets for cyber security applications and systems • Perform data pre-processing, data exploration and feature engineering on various types and volumes of data for different cyber security applications • Perform machine learning mode training and evaluation for cyber security applications. • Understand the security threats to machine learning systems deployed for cyber security • Analyse the trends of applications of machine learning in the cyber security domain • Communicate professionally in written and oral form to a range of audiences 3 Submission This is a group (of at most 3 students) project. Only one submission is required per group. Your team needs to submit the interim report in iLearn via a Turnitin submission portal before the submission deadline. Presentation Your team is required to do a presentation for your project to discuss the data analysis process with machine learning, interpreting your results and findings, and your experience in applying machine learning models in cyber security applications. The group presentation will be submitted via video recording. So, you need to have a group meeting with your teammates online and do the presentation. This needs to be recorded for submission. Each group will present for 10 minutes, and each student is expected to present for around 3 minutes. (To make the video have a reasonable size, we recommend Zoom meeting and recording, like what we did for the lecture recordings). Final Report Your team needs to submit a final report discussing the process of conducting the project. Your team needs to submit the final report in iLearn via a Turnitin submission portal before the submission deadline. Source Code Given this is a group project, all the project related materials should be managed by a team collaboration platform. Specifically, you are required to use GitHub to manage this project. It’s good that some students have adopted GitHub for their Assignment I. • The source code (in Python) that can be executed with ease. No GUI is required, but clear README or other documentation should be given. This means that we can repeat your data analysis process and results with little effort. • If the data analysis tasks are reported in the form of Jupyter Notebook, the amount of the report should be roughly equivalent to what has been requested as above. The related literature review and the reflection should still be submitted as required above. • We will use Github Classroom to manage your repos. Your team needs to use the following link to create your project repo: https://classroom.github.com/g/RvVd6tvJ. When you follow the link you will be asked to select your team. If you are the first in your group you should enter the group name which must be the same as in iLearn, e.g., Group A, Group B. If you are not the first then you should be able to select your group from the list. This will then allow you to collaborate with your teammates on the repo in Github. https://classroom.github.com/g/RvVd6tvJ Your presentation slides need to be submitted as well, with file name “Project- presentation_Group<#group>-”. You should include it in the GitHub repo. 4 Marking Criteria • All the required data analysis tasks and research tasks have been reasonably accomplished. • The organisation, presentation and readability of the reports. • Appropriate justification of which you have chosen and what you have done in the data analysis process, as well as critical thinking and understanding on the related aspects of the machine learning methods. • The related literature review is well conducted with some depth and the self-reflection on your project experience is clearly described. • Presentation is well performed by the team, showing professional communication capability and team coordination. • The source code repository is well maintained, with frequent and informative commit messages to show your work progress. Clear and informative README and high quality source code. Note that you are expected to use Python for your project. 5 Late Submissions Late submissions will incur the following penalties: • 10% penalty for 1 to 24 hours late, • 20% penalty for 24 to 48 hours late, and • 100% penalty for over 48 hours late (iLearn submission portal automatically closes). If you have a legitimate reason for submitting late, discuss this with the convenor well in advance of the submission due date. 6 Peer Evaluation Group projects enable students to develop skills and attributes that are very different to those developed through working individually. Some examples of these are team building, communication, negotiation and respect for others’ perspectives. Good teamwork involves managing a process so that everybody communicates and effectively contributes to the task and final product. These team attributes are important in professional practice where people need to work with others to share expertise and to achieve a result. Peer evaluation is an excellent way to evaluate team work, since the members of the group know how well the group functioned and the relative contribution of each member of the group. In this unit, we will use the SparkPlus software to evaluate the contributions of each student to a group work task. 7 Project Details Total Marks: 20 Project Overview: In this project, you are required to work in a group of at most three students to 1. conduct analysis on real-world data sets (see Table 1) in various cyber security applications, and 2. review the recent research on the topics related to machine learning and security. In the first task, we leverage publicly available cybersecurity datasets related to a particular application space such as “Web Attack Detection” and “Network Intrusion Detection” to evaluate machine learning models we have developed, and can train machine learning models for specific real-world application deployment. In the second task, you are required to further analyse the research trend in machine learning and cyber security. Your first and second tasks should be aligned with each other. In other words, if you choose to analyse a dataset (as your first task) in “Web Attack Detection” then you are required to review (as your second task) research papers related to “Web Attack Detection”. Data Analysis Task (Marks: 10) You are expected to conduct anomaly detection on at least TWO (at least THREE for a group of 3 students) of the following data sets listed in Table 1. It is worth noting that unlike the assignments, we have not performed any pre-processing on these data sets. You could get the basic information about the raw data sets on the webpages offering these data sets. For some data sets, you could also find the information about how these data sets have been used in other machine learning models. While you can learn from others on how to pre-process the data sets, it is important to note that we just intend to make use of the data sets, rather than other information on the webpages. Table 1 Data Sets for Project Cyber Security Applications Data Sets Web Attack Detection  Web Attack Payloads (https://github.com/foospidy/payloads)  http://blog.ptsecurity.com/2019/02/detecting-web- attacks-with-seq2seq.html Network Intrusion Detection  UNSW-NB15 network packets datasets (https://www.unsw.adfa.edu.au/unsw-canberra-cyber/ cybersecurity/ADFA-NB15-Datasets/) Malicious URL Detection  Malicious URLs Data Sets (http://www.sysnet.ucsd.edu/projects/url/) https://github.com/foospidy/payloads http://sysnet.ucsd.edu/projects/url/ https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/ http://blog.ptsecurity.com/2019/02/detecting-web-attacks-with-seq2seq.html http://blog.ptsecurity.com/2019/02/detecting-web-attacks-with-seq2seq.html Malware Detection  Malware Training Data Sets (https://github.com/marcoramilli/MalwareTrainingSets) Probing/Port Scan Detection  Probing Dataset (https://github.com/gubertoli/ProbingDataset) You are required to select and download a t least TWO (at least THREE for a group of 3 students) data sets listed in Table 1 based on your preference. Then, you need to perform data pre-processing and feature engineering on the selected data sets to prepare data for the anomaly detection models. Depending on a specific data set, this stage may involve the steps such as missing value handling, duplication removal, categorical/continuous attribute processing, TF-IDF feature extraction, time series feature extraction, subsampling/oversampling, dimension reduction, and class label handling. The whole point of these steps is to produce an appropriate data set for the training and evaluation of detection models. Note that some steps may be repeated during model training and tuning in order to obtain better detection performance. All the pre-processing and feature engineering details should be described in detail and justified reasonably in the reports. You need to use at least two Anomaly detection methods (discussed in our lectures) and you need to compare the detection performance (e,g., AUC and execution time) for these two specific anomaly detection models and show which is better for a data set. Feature selection and hyperparameter tuning should also be considered for performance improvement. The process of training/testing the models, feature selection, hyperparameter tuning, and model comparison should be reported, with visualization of intermediate or final results. Deep result analysis and critical thinking linking to some ML theoretical aspects are also expected. Research Task (Marks: 10) You are required to conduct a literature review on the research topics related to machine learning and cyber security which have attracted plenty of attention from academia and industry. Note that the application area of your research task should be aligned with that of your data-analysis task. By the literature review, you could analyse the trends of applications of machine learning in the cyber security domain. Find at least five (at least 8 for a group of 3 students) research- based publications
Answered 18 days AfterMay 15, 2021COMP8325Macquarie University

Answer To: This is machine learning subject.See the file called 'Task.pdf' everything that I want to be done in...

Uttam answered on May 21 2021
135 Votes
Example_new/EXAMPLE_FINAL.html
Anomaly detection can be treated as a statistical task as an outlier analysis. But if we develop a machine learning model, it can be automated and as usual, can save a lot of time. There are so many use cases of anomaly detection. Credit card fraud detection, detection of faulty machines, or hardware systems detection based on their anomalous features, disease detection based on medical records are some good examples. There are many more use cases. And the use of anomaly detection will only grow.
In [2]:

# Total Marks: 2
# Task1: Anomaly Detection Method 1 For Phishing or Malicious URL Detection
# Importing the Libraries.
import pandas as pd
import numpy as np
import random
from sklearn import svm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.datasets import load_svmlight_file
import glob
files = [file for file in glob.glob("\\url_svmlight\\url_svmlight\\*")]
for file_name in files:
dataset = load_svmlight_file(file_name)
import matplotlib.pylab as plt
import scipy.sparse as sparse
A = dataset[0]
# visualize the sparse matrix with Spy
plt.spy(A)
from sklearn.datasets import load_svmlight_file
X_train, y_train = dataset
import matplotlib.pylab as plt
import scipy.sparse as sparse
A = sparse.random(100,90, density=0.01)
# visualize the sparse matrix with Spy
plt.spy(A)


Out[2]:



In [4]:

clf = IsolationForest(max_samples=100, random_state=42)
clf.fit(X_train)


C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_iforest.py:285: UserWarning: max_samples (100) is greater than the total number of samples (64). max_samples will be set to n_samples for estimation.
warn("max_samples (%s) is greater than the "

Out[4]:
IsolationForest(max_samples=100, random_state=42)
In [9]:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)
# Generate train data
X = 0.3 * rng.randn(100, 2)
X_train = np.r_[X + 2, X - 2]
# Generate some regular novel observations
X = 0.3 * rng.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
# Generate some abnormal novel observations
X_outliers = rn
g.uniform(low=-4, high=4, size=(20, 2))
# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)
# plot the line, the samples, and the nearest vectors to the plane
xx, yy = np.meshgrid(np.linspace(-5, 5, 50), np.linspace(-5, 5, 50))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.title("IsolationForest")
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white',
s=20, edgecolor='k')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='green',
s=20, edgecolor='k')
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='red',
s=20, edgecolor='k')
plt.axis('tight')
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.legend([b1, b2, c],
["training observations",
"new regular observations", "new abnormal observations"],
loc="upper left")
plt.show()



In [12]:

# Making predictions on training , test and outlier data.
y_train = y_train.reshape(-1,1)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

EXPLANATION¶
Yes, We have generated the random dataset out of our dataset as our dataset is huge and my computer just can't handle such big data . That is why, I have considered the random data out of complete data to train my model.
Feature engineering is a well-known concept the machine learning field. For my purposes, I treat this as a process for creating metrics from data sources that by structure don’t lend themselves easily to large scale analysis. Since many of the performance metric sources I use are in multi-column tables, I’ll logically unpivot the columns from the MCT’s of many metric sources to key-value-pair.rows, then union all the KVP’s together for analysis. This not only greatly simplifies the analysis but allows me to massively expand the analysis to tens of thousands of metrics
In [17]:

# Total Marks: 2
# Task1: Analysis of Anomaly Detection y visualizing Another dataset with AUC curve For Phishing or Malicious URL Detection
from sklearn.datasets import load_breast_cancer
# # importing Dataset
X, y = load_breast_cancer(return_X_y=True)
from sklearn.ensemble import IsolationForest
# Building the model
clf = IsolationForest(max_samples=100,
random_state=0, contamination='auto')
clf.fit(X)
y_pred = clf.score_samples(X)
## Plotting the AUC curve
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y,y_pred)
import matplotlib.pyplot as plt
plt.plot(fpr, tpr, 'k-', lw=2)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()
'''
The AUC is high because of the better performance of model at distinguishing between the positive and negative classes.
'''




Out[17]:
'\nThe AUC is high because of the better performance of model at distinguishing between the positive and negative classes.\n'
EXPLANATION¶
Yes, Because the dataset is too big that you have provided to us and was similar to breast cancer dataset so for the better understanding of AUC curve I have just used this tosee how AUC curve will be obtained.
In [18]:

# Total Marks: 2
# Task1: Anomaly Detection Method 2 For Phishing or Malicious URL Detection
import numpy as np
import pandas as pd
## Generating Data Randomly
random_seed = np.random.RandomState(12)
X_train = 0.5 * random_seed.randn(500, 2)
X_train = np.r_[X_train + 3, X_train]
X_train = pd.DataFrame(X_train, columns=["x", "y"])
X_test = 0.5 * random_seed.randn(500, 2)
X_test = np.r_[X_test + 3, X_test]
X_test = pd.DataFrame(X_test, columns=["x", "y"])
X_outliers = random_seed.uniform(low=-5, high=5, size=(50, 2))
X_outliers = pd.DataFrame(X_outliers, columns=["x", "y"])
%matplotlib inline
import matplotlib.pyplot as plt
## plot the data using Scatter plot
p1 = plt.scatter(X_train.x, X_train.y, c="white", s=50, edgecolor="black")
p2 = plt.scatter(X_test.x, X_test.y, c="green", s=50, edgecolor="black")
p3 = plt.scatter(X_outliers.x, X_outliers.y, c="blue", s=50, edgecolor="black")
plt.xlim((-6, 6))
plt.ylim((-6, 6))
plt.legend(
[p1, p2, p3],
["training set", "normal testing set", "anomalous testing set"],
loc="lower right",
)
plt.show()
from sklearn.ensemble import IsolationForest
## Building the model
clf = IsolationForest()
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)
X_outliers = X_outliers.assign(pred=y_pred_outliers)
print(X_outliers.head())
## Plot the data using Scatter plot
p1 = plt.scatter(X_train.x, X_train.y, c="white", s=50, edgecolor="black")
p2 = plt.scatter(
X_outliers.loc[X_outliers.pred == -1, ["x"]],
X_outliers.loc[X_outliers.pred == -1, ["y"]],
c="blue",
s=50,
edgecolor="black",
)
p3 = plt.scatter(
X_outliers.loc[X_outliers.pred == 1, ["x"]],
X_outliers.loc[X_outliers.pred == 1, ["y"]],
c="red",
s=50,
edgecolor="black",
)
plt.xlim((-6, 6))
plt.ylim((-6, 6))
plt.legend(
[p1, p2, p3],
["training observations", "detected outliers", "incorrectly labeled outliers"],
loc="lower right",
)
plt.show()
## This is Isolation forest model.
# Anomaly detection with Isolation Forest is a process composed of two main stages:
# In the first stage, a training dataset is used to build iTrees as described in previous sections.
# In the second stage, each instance in the test set is passed through the iTrees build in the previous stage, and a proper “anomaly score” is assigned to the instance using the algorithm described below
# Once all the instances in the test set have been assigned an anomaly score, it is possible to mark as “anomaly” any point whose score is greater than a predefined threshold, which depends on the domain the analysis is being applied to.





x y pred
0 3.947504 2.891003 -1
1 0.413976 -2.025841 -1
2 -2.644476 -3.480783 -1
3 -0.518212 -3.386443 -1
4 2.977669 2.215355 1


In [ ]:

# Total Marks: 2
# Task1: Analysis of Anomaly Detection Method 2 For Phishing or Malicious URL Detection
## Doing more visualizations for better understanding of the data.
p1 = plt.scatter(X_train, X_train, c="white", s=50, edgecolor="black")
p2 = plt.scatter(
X_test.loc[X_test.pred == 1, ["x"]],
X_test.loc[X_test.pred == 1, ["y"]],
c="blue",
s=50,
edgecolor="black",
)
p3 = plt.scatter(
X_test.loc[X_test.pred == -1, ["x"]],
X_test.loc[X_test.pred == -1, ["y"]],
c="red",
s=50,
edgecolor="black",
)
plt.xlim((-6, 6))
plt.ylim((-6, 6))
plt.legend(
[p1, p2, p3],
[
"training observations",
"correctly labeled test observations",
"incorrectly labeled test observations",
],
loc="lower right",
)
plt.show()

In [ ]:

# Total Marks: Literature Review: 5
# Presentation: 5 Marks
# Task2: Research Task
'''
Malicious URL, a.k.a. malicious website, is a common and serious threat to cybersecurity. Malicious URLs host unsolicited
content (spam, phishing, drive-by exploits, etc.) and lure unsuspecting users to become victims of scams
(monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year.
It is imperative to detect and act on such threats in a timely manner. Traditionally, this detection is done mostly through
the usage of blacklists. However, blacklists cannot be exhaustive, and lack the ability to detect newly generated malicious
URLs. To improve the generality of malicious URL detectors, machine learning techniques have been explored with increasing
attention in recent years. This article aims to provide a comprehensive survey and a structural understanding of Malicious
URL Detection techniques using machine learning. We present the formal formulation of Malicious URL Detection as a machine
learning task, and categorize and review the contributions of literature studies that addresses different dimensions of this
problem (feature representation, algorithm design, etc.). Further, this article provides a timely and comprehensive survey for
a range of different audiences, not only for machine learning researchers and engineers in academia, but also for professionals
and practitioners in cybersecurity industry, to help them understand the state of the art and facilitate their own research and
practical applications. We also discuss practical issues in system design, open research challenges, and point out some
important directions for future research.
Phishing is a form of cybercrime where spammed emails and fraudulent websites entice victims to provide sensitive information
to the phishers. The acquired sensitive information is subsequently used to steal identities or gain access to money.
This explores the possibility of utilizing confidence weighted classification combined with content based phishing URL
detection to produce a dynamic and extensible system for detection of present and emerging types of phishing domains.
system is capable of detecting emerging threats as they appear and subsequently can provide increased protection against
zero hour threats unlike traditional blacklisting techniques which function reactively.
Phishing is an online criminal act that occurs when a malicious webpage impersonates as legitimate webpage so as to acquire
sensitive information from the user. Phishing attack continues to pose a serious risk for web users and annoying threat within
the field of electronic commerce. This paper focuses on discerning the significant features that discriminate between
legitimate and phishing URLs. These features are then subjected to associative rule mining—apriori and predictive apriori.
The rules obtained are interpreted to emphasize the features that are more prevalent in phishing URLs. Analyzing the
knowledge accessible on phishing URL and considering confidence as an indicator, the features like transport layer security,
unavailability of the top level domain in the URL and keyword within the path portion of the URL were found to be sensible
indicators for phishing URL. In addition to this number of slashes in the URL, dot in the host portion of the URL and length
of the URL are also the key factors for phishing URL.
According to the RSA’s online fraud report, the year 2013 has been confirmed to be a record year where many phishing
attacks have been launched globally. Additional1y, RSA estimates that over USD $5.9 billion was lost by global organizations
due to phishing attacks at the same period. The Internet Security Threat Report 2014 reports that cybercrimes are
prevailing and damaging threats from cybercriminals still emerge over businesses and customers. According to RSA monthly
fraud report January 2014, the 5 big data analytics and broader intelligence will lead to faster detection resulting in
lower financial losses. Data mining techniques are used to extract helpful information by analyzing the past information then
predicting the future incidents.
Phishing is a major danger to web users. The fast growth and progress of phishing techniques create an enormous challenge in
web security. Zhang et al. proposed CANTINA, a completely unique HTML content method for identifying phishing websites.
It inspects the source code of a webpage and makes use of TF-IDF to find the utmost ranking keywords. The keywords obtained are
given as input to google search engine and examined whether the domain name of the URL matches with N top search result and is
considered as legitimate. This approach fully relies on google search engine. CANTINA+ proposed by Xiang et al is an
upgraded version of CANTINA, in which new features are included to achieve better results. In particular, the authors include
the HTML Document Object Model, third party and google search engines with machine learning technique to identify phishing web
pages.
'''
'''References
[1] Amber van der Heijden and Luca Allodi, "Cognitive Triaging of Phishing Attacks", Usenix Security, 2019,
https://www.usenix.org/system/files/sec19-van_der_heijden.pdf
[2] Oest, Adam, Penghui Zhang, Brad Wardman, Eric Nunes, Jakub Burgis, Ali Zand, Kurt Thomas, Adam Doupé,
and Gail-Joon Ahn. "Sunrise to sunset: Analyzing the end-to-end life cycle and effectiveness of phishing
attacks at scale." In 29th {USENIX} Security Symposium ({USENIX} Security 20), pp. 361-377. 2020.
'''

Example_new/EXAMPLE_FINAL.ipynb
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Anomaly detection can be treated as a statistical task as an outlier analysis. But if we develop a machine learning model, it can be automated and as usual, can save a lot of time. There are so many use cases of anomaly detection. Credit card fraud detection, detection of faulty machines, or hardware systems detection based on their anomalous features, disease detection based on medical records are some good examples. There are many more use cases. And the use of anomaly detection will only grow."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAOUAAAD8CAYAAACIEGNFAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAARf0lEQVR4nO3db4xcV33G8e+TGBNiE2LXTlgSXCciComgIWQFgVSIYlIFEsV50UhBCnKrVH4DJSAq4rS0CKkvXBUheFEhWQRqFZQqDWkTJQhiueRF+yLtOgmNg5M6Jf8xtoMEKSkiuPn1xdyVF7PruTP33Dvn3vN8pNXuzO6cOXtnfvOcc+6ZXUUEZpaPU2bdATP7dS5Ks8y4KM0y46I0y4yL0iwzLkqzzHRWlJKukvSEpCcl7WjY1tckHZG0f8l16yXtkXSw+rxuwjbfLOl7kg5IekzSzYnaPU3Sv0v6ftXu51O0W7VxqqSHJd2bsM2nJT0q6RFJCwnbPVPSnZIer47xexIc2wurfi5+vCTpkwna/VT1WO2XdHv1GDY+BnV1UpSSTgX+FvgQcDHwEUkXN2jy74CrTrhuB7A3Ii4A9laXJ3EM+HREXARcDnys6mPTdn8JfCAiLgHeAVwl6fIE7QLcDBxYcjlFmwC/FxHviIj5hO1+GfhORLwVuKTqd6N2I+KJqp/vAC4D/hf4pybtSjoH+AQwHxFvA04Fbmja14lEROsfwHuA7y65fCtwa8M2NwP7l1x+Apirvp4DnmjY/t3AlSnbBU4HHgLe3bRd4NzqyfEB4N5UxwB4GthwwnVN+3oG8BSglO2e0NbvA//WtF3gHOA5YD2wCri3ajvp8+tkH10NXxd/0UXPV9eldHZEHAKoPp81bUOSNgOXAg+maLcaZj4CHAH2RESKdr8EfAZ4dcl1KY5BAPdL2idpe6J2zweOAl+vhttflbQmUX8X3QDc3rS/EfEC8AXgWeAQ8LOIuD9xX0+qq6LUMtdlub9P0lrgW8AnI+KlFG1GxP/FaIh1LvAuSW9r0p6ka4AjEbEvRf9OcEVEvJPRVONjkt6XoM1VwDuBr0TEpcDLJBz+SVoNXAv8Y4K21gFbgfOANwFrJN3YtN1JdFWUzwNvXnL5XOBHie/jsKQ5gOrzkUkbkPQaRgX5zYi4K1W7iyLip8ADjObDTdq9ArhW0tPAPwAfkPSNFH2NiB9Vn48wmp+9K0G7zwPPVyMEgDsZFWmqY/sh4KGIOFxdbtLuB4GnIuJoRPwKuAt4b8K+jtVVUf4HcIGk86pXtRuAexLfxz3AturrbYzmhLVJEnAbcCAivpiw3Y2Szqy+fh2jB/3xJu1GxK0RcW5EbGZ0LP8lIm5M0Nc1kl6/+DWjudT+pu1GxI+B5yRdWF21BfhB03aX+AjHh640bPdZ4HJJp1fPiS2MFqVS9XW8tiary0ygPwz8F/DfwJ83bOt2RuP9XzF6Fb4J+C1GCx8Hq8/rJ2zzdxkNqf8TeKT6+HCCdn8HeLhqdz/wl9X1jdpd0v77Ob7Q07Sv5wPfrz4eW3ycUvSV0crzQnUc/hlYl6jd04GfAG9Ycl3T4/B5Ri+c+4G/B16b6vGq86GqE2aWCe/oMcuMi9IsMy5Ks8y4KM0y46I0y0yjotQU7/xYsnUrqT6126e+9q3dPvV1RQ3OFZ7K6Jzj+cBqRue2Lq5xu4U2zu30qd0+9bVv7fapryt9NEnKdwFPRsQPI+IVRtu9tjZoz8xg+s0Dkv4AuCoi/ri6/FHg3RHx8ZVus2HDhlizZg0bN26c6j5P5ujRo1m2e+DQSxx7dfwxXnWKuGjujKnvB4739dEXflb7Nm8/5w21202tjXa76muKY7xv374XI+I3Ortqiv4tqvXOj2osvh1g06ZNPPPMMw3usn8277iv9s8u7Ly6t/dZmhTHWNKyxdCkKGu98yMidgG7AObn572nrwDzf7WHF3/+ytif27B2NQufvbKDHvVLkzllF+/8sB6qU5CT/Fxppk7KiDgm6ePAdxmtxH4tIh5L1jOzQjUZvhIR3wa+nagvZoZ39AzShrWrk/6cdatRUlqevHgymdwWppyUVrxpFqbaHI04KQuUWzL0UZvHxUnZshzndz5lkTcnZcucNDYpJ6Ull+PooE+clJacRwfNOCnNMuOiNMuMh69mEzjZW7ZSnUJyUlrxUi04pTqF5KS04o1Lt0ne0JyCk9IsMy5Ks8y4KM0y46I0y4yLskDeBpc3r74WxG/Z6gcnZUH8lq1+cFLa1EpJ3g1rV9f+PVNwUdrUSknerl9QPHw1y4yTcolShmOWNyflEqUMxyxvLkqzzLgozTLTqzml53zH+VgMV6+S0nO+43wshqtXRWlWAhelTc0b29vRqzml5WVoc9Vc5ulOSrNKLvN0J+USKTYe1321Xa7NoSWPTcdFuUSKopj2VdSrpLZo7PBV0pslfU/SAUmPSbq5un69pD2SDlaf17XfXWvCCzP9UCcpjwGfjoiHJL0e2CdpD/CHwN6I2ClpB7ADuCV1B6cdDtpvKmF4POvFmknufyVjkzIiDkXEQ9XX/wMcAM4BtgK7qx/bDVw3tidTmHZ+NnROveXNerEmxf1PNKeUtBm4FHgQODsiDsGocCWdtcJttgPbATZt2jTJ3U3s6Z1Xt9p+TkpIvT5JOaKrfUpE0lrgW8AnI+KlureLiF0RMR8R8xs3bpymj2bZS5m8tYpS0msYFeQ3I+Ku6urDkuaq788BR5L1yqxgdVZfBdwGHIiILy751j3AturrbcDd6btn1p1c5ul15pRXAB8FHpX0SHXdnwE7gTsk3QQ8C1zfSg/NOpLLPH1sUUbEvwJa4dtb0nan/+ruClrudmbgHT3J5fJqa/3lDelmmck+KVP9depZ7/SwbnT918yntWHtap5Z4XvZF2WqApn1Tg/rRh9eUBc3uegvlv++h69mCaQ8nZJ9Ui7HQ1HLTcrnWS+T0kNRG7JeJqXZrLU5WutlUprNWpujNSflDHlubMtxUs6Q58a2HBelWWaKKcpc3pZjNk4xc8pJ5mSe69ksFZOUk/Bcz2bJRWmWmWKGr0PnIfdwOCkHwkPubrW5cOikNJvCuNHG4sjlxZ+/wuYd9y37M6vf+JbLlrveSTlDPk0zXE1GJE7KGSppbuc5b31OSuuE57z1uSjNMtPLovRczIasl3PK0uccbfLcb/Z6mZTWHs/9Zq+XSWm/Kce/d9r0fzaWmtouyoHI8UnZNE1LTW0PX81a0GRE4qRsaLktVEMbTtnk6jz++utr9i13vZNyGU3nXUMbTnXJp7E6TspHX/jZiptzIZ+EWakPJ+u7NbP4/zWmNaQRS1ZJ6YSxlPr6fMqqKM1sgqKUdKqkhyXdW11eL2mPpIPV53XtddO64i2MszfJnPJm4ABwRnV5B7A3InZK2lFdviVx/6xjfZyDDU2tpJR0LnA18NUlV28Fdldf7wauS9oz672mqVtqGtdNyi8BnwFev+S6syPiEEBEHJJ0VuK+Wc81Td1SV8HHJqWka4AjEbHsic4at98uaUHSwjS3NytNnaS8ArhW0oeB04AzJH0DOCxprkrJOeDIcjeOiF3ALoDXzl0Qifo9Ezlu+p5GqRu9+2JsUUbErcCtAJLeD/xpRNwo6W+AbcDO6vPd7XUzD0N5gpa60bsvmpyn3AlcKekgcGV1uZHcE8byMPTTNhNts4uIB4AHqq9/AmyZ5PZvP+cNLDTcTmU2lBHLSryjxywzfuuWZa+0hSknpWWvtIUpJ6W1prSES8VJaa0pLeFScVGaZcZFWaChn+frO88pC+T5W96clGaZcVJaEfq0EuyktCL0aSXYRWnZK21hysNXy96sh5Ndc1Jaa0pLuFSKT8o+LQD0TV+P10p/A6ir50DxSdmnBQCbra6eA8UXpVluXJRmmSl+Tmk2jTbXIpyUVoTUK7xtrkU4KXvOq8f1nOx3z+0vrjspe86rx8PjojTLjIev1ntDG8IXn5TeCtZ/TYfwuT0Hik/KPrxyWrtyew4Un5RmuSk+KdsytHmOdcdJ2RKfqhi2NuehTkqzKbQ5unFS9lxuK4fWnJOy5zwfHR4npVlmskvKrlctvUrafxvWrq79GPZBdkXZ9aqlV0n7b2gvlrWGr5LOlHSnpMclHZD0HknrJe2RdLD6vK7tzpqVoO6c8svAdyLircAlwAFgB7A3Ii4A9laXzayhsUUp6QzgfcBtABHxSkT8FNgK7K5+bDdwXTtd7CefqrBp1ZlTng8cBb4u6RJgH3AzcHZEHAKIiEOSzlruxpK2A9sBNm3alKTTfbA4zxm3kPTiz19h8477illI8sLaeHWGr6uAdwJfiYhLgZeZYKgaEbsiYj4i5jdu3DhlN/vLC0m/zsdjvDpJ+TzwfEQ8WF2+k1FRHpY0V6XkHHCkrU5aP9RNwb7pOt3HJmVE/Bh4TtKF1VVbgB8A9wDbquu2AXc37o312hALErpP97rnKf8E+Kak1cAPgT9iVNB3SLoJeBa4PkmPzApXqygj4hFgfplvbUnaG7rfnTG03SDWf9nt6Ol6xa3UFT7Llzekm2Umu6Q8UdsrXz5vZrnJPinbXvnyeTPLTfZF2Xfebjedko9H9sPXnEzzb7c95F3e0zuvnnUXxprVZggnZQIe2g7TpI9rqnR3UpolkDL5nZSWjOfPaTgpLRnPn9NwUpplxklpjXkDRlrFJ6XnN815A0ZaxSflyV65Vzovad0qLYmLT0rLX2lJ7KI8CS/xl21Wj3/xw9eTGcJQqE1D/Zs8i2b1+DspbWpDLshZclGaZSb7ovS8zkqT/ZyyrXF9acvsuRv6/HQS2SdlW0pbZs/ByUYzPs7HZZ+UNgx9eFNzLopNShueoawrOCmtd4aeuk5Ks8w4KWuY5g9mlcD/8qEdLsoGSl8xLPkFqU0evpplxkVplplii9LznLx4O+Vxxc4p/RcH2uetjNMpNimtfd7KOJ1Ok/LAoZdqpdAQXzlLSg1vLm+mVlJK+pSkxyTtl3S7pNMkrZe0R9LB6vO6ce0cezVqdWqID2hJqTGE32GWxialpHOATwAXR8QvJN0B3ABcDOyNiJ2SdgA7gFta7a0NVsp07ftoo+6cchXwOkmrgNOBHwFbgd3V93cD1yXv3Yx4JbB7KdO170k9Nikj4gVJXwCeBX4B3B8R90s6OyIOVT9zSNJZLfe1M31+lbX+G5uU1VxxK3Ae8CZgjaQb696BpO2SFiQtTN9Ns3LUGb5+EHgqIo5GxK+Au4D3AoclzQFUn48sd+OI2BUR8xExn6rTZkNWpyifBS6XdLokAVuAA8A9wLbqZ7YBd7fTRbOy1JlTPijpTuAh4BjwMLALWAvcIekmRoV7fZsdteGq+xawUtTaPBARnwM+d8LVv2SUmr3WZCm+70vvJ9PVZofFvyLgrY3HdbrNbtUpqvVzXZ5qaPIKPclt+3aapclmh779rrnpdJvdRXNnsDDwv6+ykqEm6nJK+l3b4A3ploWUqdn3BC72rVttK2kDego+Bsc5KVtS0gZ0S8tJaTaBLkZATkqzCXQxAnJSWmPjzjF63jwZJ6W1zvPmyRSflE22ePVt6X2S+ZC3vjW3dAQxyWih+KIsaVg1yXyozj/R8da4+iZ5gfPw1SwzxSel9c/QN2Y4KVviTdntGfrGjGyScmivfn3oo+Upm6Qc+qufDUMXI5tsitKsDxY+eyVP77y61eJ0UdrUSp43tzliy2ZOaf3jeXM7nJQFKTnZZm2SY+qkLEiKZBvaKnkX6uyOWspJaRPxKnn7XJTWO0Mfhnv4mliT4Z2HhvUM/XfPJimH8urXZHjnoaFBRknZh1c//9vw2SplJJFNUvaBC3K2chpJtDmyyyYpzfqkzSR2UpplppdJWcrcYoj82I3Xy6TMaW5RmqZzKT924/UyKW12Sk2vLvUyKc2GzEU5I8sN74aygcKa8fC1JZO+MwA8NLQRJ+UEnGSzVcrxd1JOwEk2W6Ucf0VEd3cmHQVeBl5s0s7qN77lsro/+8qPn9zX5L6ADTTsb0dt9qLdDh677I/BEr8dERtPvLLTogSQtBAR8yW326e+9q3dPvV1JZ5TmmXGRWmWmVkU5S6326u+9q3dPvV1WZ3PKc3s5Dx8NcuMi9IsMy5Ks8y4KM0y46I0y8z/A6maz/JVk8WbAAAAAElFTkSuQmCC\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Total Marks: 2\n",
"# Task1: Anomaly Detection Method 1 For Phishing or Malicious URL Detection\n",
"# Importing the Libraries.\n",
"import pandas as pd\n",
"import numpy as np\n",
"import random\n",
"from sklearn import svm\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.ensemble import IsolationForest\n",
"from sklearn.datasets import load_svmlight_file\n",
"\n",
"\n",
"import glob\n",
"files = [file for file in glob.glob(\"\\\\url_svmlight\\\\url_svmlight\\\\*\")]\n",
"for file_name in files:\n",
" dataset = load_svmlight_file(file_name)\n",
"\n",
"\n",
"import matplotlib.pylab as plt\n",
"import scipy.sparse as sparse\n",
"A = dataset[0]\n",
"# visualize the sparse matrix with Spy\n",
"plt.spy(A)\n",
"\n",
"\n",
"\n",
"from sklearn.datasets import load_svmlight_file\n",
"X_train, y_train = dataset\n",
"\n",
"\n",
"\n",
"import matplotlib.pylab as plt\n",
"import scipy.sparse as sparse\n",
"A = sparse.random(100,90, density=0.01)\n",
"# visualize the sparse matrix with Spy\n",
"plt.spy(A)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\ensemble\\_iforest.py:285: UserWarning: max_samples (100) is greater than the total number of samples (64). max_samples will be set to n_samples for estimation.\n",
" warn(\"max_samples (%s) is greater than the \"\n"
]
},
{
"data": {
"text/plain": [
"IsolationForest(max_samples=100, random_state=42)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clf = IsolationForest(max_samples=100, random_state=42)\n",
"clf.fit(X_train)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"image/png":...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here