Homework 3: ----
Processing Internet Relay Chat (IRC) logs.
The purpose of this homework is to analyze textual log data from an online chat forum related to the Anonymous hacktivist group.
You will learn how to apply regular expressions, summarize log data, quantify text data, and summarize time trends.
IRC is an early protocol for instant messaging developed in the early years of the Internet.
The openness and ability to remain anonymous has made IRC a popular channel for hacker networks to collaborate and share ideas.
The data comes from the AZSecure Project. It contains two years of chats between hackers associated with the hacktivist group Anonymous.
In these logs they share information about malware, setting up servers to deploy attacks, and other information related to hacking systems.
The collection and analysis of these chats is a form of cyber-threat intelligence.
The analysis of these chats and other dark web data sources enable proactive defense against attacks.
1. User Data
Which users posted the most messages (2pts)?
Which users logged in the greatest number of times? (2pts)
Which users spent the most time in the chat? (3pts)
Count the total number of written messages (only those with actual text content) (1 pts).
Find the most common words (only include message content) (2 pts)
Find and rank (by count) words not in an English dictionary (2 pts). This is a simple method that can identify some names of malware tools.
How many distinct URLs were posted in the chat? (1 pt)
Which URLs were posted the most (top 5)? (1 pt)
Generate a list of sites on the Dark Web (sites ending in .onion) (1pt)
3. General Activity
Which hours of the day had the most messages (1 pt)?
Which days had the most messages (top 10 days) (2pts)?
Rank the days of the week by average message count (1pt).
Formatting Your Answers: ----
1. User Data Example: The user guapo had the most posts, with 11972 messages.
2. Messages Example: The total number of chat messages is: 229,606. If you exclude evilbot, there are 198,883 messages.
3. General Activity Examples:
(1) The time of day with the most messages: 20:00-20:59: 13407 messages
(2) The time of day with the fewest messages: 10:00-10:59: 3696 messages.
(3) Most active day: XXXXXXXXXX: 2804 messages
(4) Least active day: XXXXXXXXXX: 7 messages
This analysis portion of the assignment is graded out of 12 points. The maximum score for analysis is 19 points.
Your code should also be well-documented with comments, sources, and explanations of what is happening. Fully documented code will receive full credit.
Mostly complete documentation will receive a deduction of a point, minimal documentation will result in a deduction of 2 points, and no documentation will result in a deduction of 3 points from your score.
<+evilbot> This user is a bot. If possible, filter this user’s posts from the chat
You can identify changes in days with the messages “--- Day changed Mon Sep XXXXXXXXXX”. There are some instances of this measure missing.
It is possible to correct this issue by looking at the times of the day (i.e. the hour rolls over to 00).
Users can change their usernames. An alternative to usernames for login-logout behavior is to use their login identifiers (for example: [ XXXXXXXXXX]).