1
SCHOOL OF ARCHITECTURE, COMPUTING &
ENGINEERING
Submission instructions
• Cover sheet to be attached to the front of the assignment when submitted
• Question paper to be attached to assignment when submitted
• All pages to be numbered sequentially
• All work has to be presented in a ready to submit state upon arrival at
the ACE Helpdesk. Assignment cover sheets or stationery will NOT be
provided by Helpdesk staff
Module code
CN7031
Module title
Big Data Analytics
Module leader
Amin Karami
Assignment tutor
A Karami, F Jafari, MA Ghazanfar, N Qazi
Assignment title
Big Data Analytics: Coursework
Assignment number
1
Weighting
100%
Handout date
Week 5 (30th October 2020)
Submission date
Presentation: Week 12 (14th-18th December 2020)
Turnitin Submission: 25th December 2020 (midnight)
Learning outcomes
assessed by this
assignment
1-8
Turnitin submission
requirement
Yes
Turnitin GradeMark feedback
used?
No
UEL Plus Grade Book
submission used?
No
UEL Plus Grade Book
feedback used?
No
Other electronic
system used?
Yes
Are submissions / feedback
totally electronic?
Yes
Additional information
2
Form of assessment:
Individual work Group work
For group work assessment which requires members to submit both individual
and group work aspects for the assignment, the work should be submitted as:
Consolidated single document Separately by each member
Number of assignment copies required:
1 2 Other
Assignment to be presented in the following format:
On-line submission
Stapled once in the top left-hand corner
Glue bound
Spiral bound
Placed in a A4 ring bound folder (not lever arch)
Note: To students submitting work on A3/A2 boards, work has to be
contained in suitable protective case to ensure any damage to work is
avoided.
Soft copy:
CD (to be attached to the work in an envelope or purpose made wallet
adhered to the rear)
USB (to be attached to the work in an envelope or purpose made wallet
adhered to the rear)
Soft copy not required
3
CN7031 – Big Data Analytics
Group assignment 2020-21 Academic Year
This coursework (CRWK) must be attempted in the groups of 4 or 5 students. This
coursework is divided into two sections: (1) Big Data analytics on a real case study and (2)
group presentation. All the group members must attend the presentation. Presentation
would be online through Microsoft Teams. If you do not turn up in the presentation date
with the video call, you will fail the module.
Overall mark for CRWK comes from two main activities as follows:
1- Big Data Analytics report (around 3,000 words, with a tolerance of ± 10%) in HTML
format (60%)
2- Presentation (40%)
Marking Scheme
Topic Total
mark
Remarks
(breakdown of marks for each sub-task)
Big Data
Analytics using
Spark SQL
30
(6) Providing 2 queries using Spark SQL.
(14) Developing advanced SQL statements. Refer to:
https://spark.apache.org/docs/3.0.0/sql-ref.html
(10) Visualizing the outcomes of queries into the graphical and
textual format, and be able to interpret them.
Big Data
Analytics using
PySpark
60
(45) Analyzing the dataset through 3 statistical analytics
methods including advanced descriptive statistics,
correlation, hypothesis testing, density estimation, etc.
(15) Designing one classifier, then evaluate and visualize the
accuracy/performance.
Applying a multi-class classifier is considered for full mark.
Documentation 10 (10) Write down a well-organized report for a programming and
analytics project.
Total: 100
IMPORTANT: you must use CRWK template in the HTML format, otherwise it will be
counted as plagiarism and your group mark would be zero. Please refer to the “THE
FORMAT OF FINAL SUBMISSION” section.
Good Luck!
https://spark.apache.org/docs/3.0.0/sql-ref.html
4
Big Data Analytics using Spark
CN7031 – Big Data Analytics
(1) Understanding Dataset: CSE-CIC-IDS20181
This dataset was originally created by the University of New Brunswick for analyzing DDoS
data. You can find the full dataset and its description here. The dataset itself was based on
logs of the university’s servers, which found various DoS attacks throughout the publicly
available period to generate totally 80 attributes with 6.40GB size. We will use about 2.6GB
of the data to process it with the restricted PCs to 4GB RAM. Download it from here. When
writing machine learning or statistical analysis for this data, note that the Label column is
arguably the most important portion of data, as it determines if the packets sent are malicious
or not.
a) The features are described in the “IDS2018_Features.xlsx” file in Moodle page.
b) The labels are as follows:
• “Label”: normal traffic
• “Benign”: susceptible to DoS attack
c) In this coursework, we use more than 8.2-million records with the size of 2.6GB. As
a big data specialist, firstly, we should read and understand the features, then apply
modeling techniques. If you want to see a few records of this dataset, you can either
use [1] Hadoop HDFS and Hive, [2] Spark SQL or [3] RDD for printing a few records
for your understanding.
1 Source: https://registry.opendata.aws/cse-cic-ids2018/ & https://www.unb.ca/cic/datasets/ids-2018.html
https://www.unb.ca/cic/datasets/ids-2018.html
https://tinyurl.com/yyqf555f
https://registry.opendata.aws/cse-cic-ids2018/
https://www.unb.ca/cic/datasets/ids-2018.html
5
(2) Big Data Query & Analysis using Spark SQL [30 marks]
This task is using Spark SQL for converting big sized raw data into useful information. Each
member of a group should implement 2 complex SQL queries (refer to the marking
scheme). Apply appropriate visualization tools to present your findings numerically and
graphically. Interpret shortly your findings.
You can use https://spark.apache.org/docs/3.0.0/sql-ref.html for more information.
• What do you need to put in the HTML report per student?
1. At least two Spark SQL queries.
2. A short explanation of the queries.
3. The working solution, i.e., plot or table.
• Tip: The mark for this section depends on the level of your queries complexity, for
instance using the simple select query is not supposed for a full mark.
(3) Advanced Analytics using PySpark [60 marks]
In this section, you will conduct advanced analytics using PySpark.
3.1. Analyze and Interpret Big Data using PySpark (45 marks)
Every member of a group should analyze data through 3 analytical methods (e.g.,
advanced descriptive statistics, correlation, hypothesis testing, density estimation, etc.). You
need to present your work numerically and graphically. Apply tooltip text, legend, title, X-Y
labels etc. accordingly.
Note: we need a working solution without system or logical error for the good/full mark.
3.2. Design and Build a Machine Learning (ML) technique (15 marks)
Every member of a group should go over https://spark.apache.org/docs/3.0.0/ml-guide.html
and apply one ML technique. You can apply one the following approaches: Classification,
Regression, Clustering, Dimensionality Reduction, Feature Extraction, Frequent Pattern
mining or Optimization. Explain and evaluate your model and its results into the numerical
and/or graphical representations.
Note: If you are 4 students in a group, you should develop 4 different models. If you have
a similar model, the mark would be zero.
https://spark.apache.org/docs/3.0.0/sql-ref.html
https://spark.apache.org/docs/3.0.0/ml-guide.html
6
(4) Documentation [10 marks]
Your final report must follow the “The format of final submission” section. Your work must
demonstrate appropriate understanding of building a user friendly, efficient and
comprehensive analytics report for a big data project to help move users (readers) around
to find the relevant contents.
THE FORMAT OF FINAL SUBMISSION
1- You can use either Google Colab (https://colab.research.google.com/) or Ubuntu
VMWare for this CRWK.
2- You have to convert the source code (*.ipynb) to HTML. Watch the video in the Moodle
about “how to submit the report in HTML format”.
3- Upload ONLY one single HTML file per group into Turnitin in Moodle. One member of
each group must submit the work, NOT all members. The name of the file must be in the
format of “Your-Group-ID_CN7031”, such as Group200_CN7031.html if you are
belonging to the group 200.
4- The submission link will be available from week 10, and you are free to amend your
submitted file several times before submission deadline. Your last submission will be
saved in the Moodle database for marking.
PLAGIARISM
If there are copied PySpark codes from somewhere or someone else, all the group members
will get zero, and should attend the “breach of regulation” committee for further explanations
and the probable additional penalties.
FEEDBACK TO STUDENTS
Feedback is central to learning and is provided to students to develop their knowledge,
understanding, skills and to help promote learning and facilitate improvement.
• Feedback will be provided as soon as possible after the student has completed
the assessment task.
• Feedback will be in relation to the learning outcomes and assessment criteria.
• It will be offered via Turnitin GradeMark or Moodle post.
As the feedback (including marks) is provided before Award & Field Board, marks are:
• Provisional
• available for External Examiner scrutiny
• subject to change and approval by the Assessment Board
https://colab.research.google.com/
7
ASSESSMENT FORM FOR PRESENTATION
CN7031 – Big Data Analytics (40%)
Students have to fill this section correctly. Assessors will not be liable for any mistakes.
Group No: ……………….
1
st
Student (full name and ID):
2
nd
Student (full name and ID):
3
rd
Student (full name and ID):
4
th
Student (full name and ID):
5
th
Student (full name and ID):
Assessment Criteria:
Criteria 1st 2nd 3rd 4th 5th Mark
Demonstrate/interpret Spark SQL queries 10
Understand Spark and its mechanism 5
Demonstrate/interpret PySpark codes 15
Ability to answer questions 10
Overall mark 40
Date & Time: ………………………….
Assessors’ signature and comments:
If you want to add comments on your group work, please write it here for us:
In [1]:
!sudo apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
!tar xf spark-3.0.1-bin-hadoop3.2.tgz
!pip install -q findspark
Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease
Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 InRelease
Hit:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Release
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release
Get:9 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:10 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ Packages [40.7 kB]
Get:11 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1,372 kB]
Get:12 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease [21.3 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:14 http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages [237 kB]
Get:15 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [1,814 kB]
Get:16 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages [15.3 kB]
Get:19 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main Sources [1,699 kB]
Get:20 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [2,136 kB]
Get:21 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [2,243 kB]
Get:22 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [266 kB]
Get:23 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [53.8 kB]
Get:24 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main amd64 Packages [870 kB]
Get:25 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic/main amd64 Packages [46.5 kB]
Fetched 11.1 MB in 3s (3,933 kB/s)
Reading package lists… Done
In [2]:
# Using operating system dependent functionality to read or write a file
import os
os.environ[“JAVA_HOME”]=”/usr/lib/jvm/java-8-openjdk-amd64″
os.environ[“SPARK_HOME”]=”/content/spark-3.0.1-bin-hadoop3.2″
import findspark
findspark.init()
In [3]:
# linking with SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.master(“local[*]”).appName(‘Group115’).getOrCreate()
# Note: If you want to work with RDD, you should use: “from pyspark import SparkContext, SparkConf”
In [4]:
# Load Data
df1 = spark.read.load(“IDS2018/02-14-2018.csv”, format=”csv”, inferSchema=True, header=True)
df2 = spark.read.load(“IDS2018/02-15-2018.csv”, format=”csv”, inferSchema=True, header=True)
df3 = spark.read.load(“IDS2018/02-16-2018.csv”, format=”csv”, inferSchema=True, header=True)
df4 = spark.read.load(“IDS2018/02-21-2018.csv”, format=”csv”, inferSchema=True, header=True)
df5 = spark.read.load(“IDS2018/02-22-2018.csv”, format=”csv”, inferSchema=True, header=True)
df6 = spark.read.load(“IDS2018/02-23-2018.csv”, format=”csv”, inferSchema=True, header=True)
df7 = spark.read.load(“IDS2018/02-28-2018.csv”, format=”csv”, inferSchema=True, header=True)
df8 = spark.read.load(“IDS2018/03-01-2018.csv”, format=”csv”, inferSchema=True, header=True)
df9 = spark.read.load(“IDS2018/03-02-2018.csv”, format=”csv”, inferSchema=True, header=True)
In [5]:
from functools import reduce
from pyspark.sql import DataFrame
# Create a list of dataframes
dfs = [df1, df2, df3, df4, df5, df6, df7, df2, df9]
# Create a merged dataframe
IDS_df = reduce(DataFrame.unionAll, dfs)
In [ ]:
# Print DF to make sure it is working
IDS_df.show()
+——–+——–+——————-+————-+————+————+—————+—————+—————+—————+—————-+—————+—————+—————+—————-+—————+—————+————-+—————-+—————-+————+————+————+—————-+—————-+———–+———–+———–+—————-+—————-+———–+———–+————-+————-+————-+————-+————–+————–+————-+————+———–+———–+————–+————–+—————-+————+————+————+————+————+————+————–+————+————-+————–+—————-+—————-+————–+————–+—————-+————–+————–+—————-+—————-+—————-+—————-+—————-+—————–+—————–+—————–+—————-+———–+———-+———-+———-+————+————–+———–+———–+——+
|Dst Port|Protocol| Timestamp|Flow Duration|Tot Fwd Pkts|Tot Bwd Pkts|TotLen Fwd Pkts|TotLen Bwd Pkts|Fwd Pkt Len Max|Fwd Pkt Len Min|Fwd Pkt Len Mean|Fwd Pkt Len Std|Bwd Pkt Len Max|Bwd Pkt Len Min|Bwd Pkt Len Mean|Bwd Pkt Len Std| Flow Byts/s| Flow Pkts/s| Flow IAT Mean| Flow IAT Std|Flow IAT Max|Flow IAT Min| Fwd IAT Tot| Fwd IAT Mean| Fwd IAT Std|Fwd IAT Max|Fwd IAT Min|Bwd IAT Tot| Bwd IAT Mean| Bwd IAT Std|Bwd IAT Max|Bwd IAT Min|Fwd PSH Flags|Bwd PSH Flags|Fwd URG Flags|Bwd URG Flags|Fwd Header Len|Bwd Header Len| Fwd Pkts/s| Bwd Pkts/s|Pkt Len Min|Pkt Len Max| Pkt Len Mean| Pkt Len Std| Pkt Len Var|FIN Flag Cnt|SYN Flag Cnt|RST Flag Cnt|PSH Flag Cnt|ACK Flag Cnt|URG Flag Cnt|CWE Flag Count|ECE Flag Cnt|Down/Up Ratio| Pkt Size Avg|Fwd Seg Size Avg|Bwd Seg Size Avg|Fwd Byts/b Avg|Fwd Pkts/b Avg|Fwd Blk Rate Avg|Bwd Byts/b Avg|Bwd Pkts/b Avg|Bwd Blk Rate Avg|Subflow Fwd Pkts|Subflow Fwd Byts|Subflow Bwd Pkts|Subflow Bwd Byts|Init Fwd Win Byts|Init Bwd Win Byts|Fwd Act Data Pkts|Fwd Seg Size Min|Active Mean|Active Std|Active Max|Active Min| Idle Mean| Idle Std| Idle Max| Idle Min| Label|
+——–+——–+——————-+————-+————+————+—————+—————+—————+—————+—————-+—————+—————+—————+—————-+—————+—————+————-+—————-+—————-+————+————+————+—————-+—————-+———–+———–+———–+—————-+—————-+———–+———–+————-+————-+————-+————-+————–+————–+————-+————+———–+———–+————–+————–+—————-+————+————+————+————+————+————+————–+————+————-+————–+—————-+—————-+————–+————–+—————-+————–+————–+—————-+—————-+—————-+—————-+—————-+—————–+—————–+—————–+—————-+———–+———-+———-+———-+————+————–+———–+———–+——+
| 0| 0|14/02/2018 08:31:01| 112641719| 3| 0| 0| 0.0| 0| 0| 0.0| 0.0| 0| 0| 0.0| 0.0| 0.0| 0.0266331163| 5.63208595E7| 139.3000358938| 5.6320958E7| 5.6320761E7|1.12641719E8| 5.63208595E7| 139.3000358938|5.6320958E7|5.6320761E7| 0.0| 0.0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0.0266331163| 0.0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 3| 0| 0| 0| -1| -1| 0| 0| 0.0| 0.0| 0.0| 0.0|5.63208595E7|139.3000358938|5.6320958E7|5.6320761E7|Benign|
| 0| 0|14/02/2018 08:33:50| 112641466| 3| 0| 0| 0.0| 0| 0| 0.0| 0.0| 0| 0| 0.0| 0.0| 0.0| 0.0266331761| 5.6320733E7| 114.5512985522| 5.6320814E7| 5.6320652E7|1.12641466E8| 5.6320733E7| 114.5512985522|5.6320814E7|5.6320652E7| 0.0| 0.0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0.0266331761| 0.0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 3| 0| 0| 0| -1| -1| 0| 0| 0.0| 0.0| 0.0| 0.0| 5.6320733E7|114.5512985522|5.6320814E7|5.6320652E7|Benign|
| 0| 0|14/02/2018 08:36:39| 112638623| 3| 0| 0| 0.0| 0| 0| 0.0| 0.0| 0| 0| 0.0| 0.0| 0.0| 0.0266338483| 5.63193115E7| 301.9345955667| 5.6319525E7| 5.6319098E7|1.12638623E8| 5.63193115E7| 301.9345955667|5.6319525E7|5.6319098E7| 0.0| 0.0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0.0266338483| 0.0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 3| 0| 0| 0| -1| -1| 0| 0| 0.0| 0.0| 0.0| 0.0|5.63193115E7|301.9345955667|5.6319525E7|5.6319098E7|Benign|
| 22| 6|14/02/2018 08:40:13| 6453966| 15| 10| 1239| 2273.0| 744| 0| 82.6| 196.7412368715| 976| 0| 227.3| 371.6778922072| 544.1615279659| 3.8735871865| 268915.25|247443.778966007| 673900.0| 22.0| 6453966.0|460997.571428571|123109.423587757| 673900.0| 229740.0| 5637902.0|626433.555555556| 455082.21422401| 1167293.0| 554.0| 0| 0| 0| 0| 488| 328| 2.3241523119|1.5494348746| 0| 976|135.0769230769|277.8347599674|77192.1538461539| 0| 0| 0| 1| 0| 0| 0| 0| 0| 140.48| 82.6| 227.3| 0| 0| 0| 0| 0| 0| 15| 1239| 10| 2273| 65535| 233| 6| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
| 22| 6|14/02/2018 08:40:23| 8804066| 14| 11| 1143| 2209.0| 744| 0| 81.6428571429| 203.7455453568| 976| 0| 200.8181818182| 362.2498635422| 380.7331748762| 2.839597068|366836.083333333|511356.609732762| 1928102.0| 21.0| 8804066.0|677235.846153846|532416.970958985| 1928102.0| 246924.0| 7715481.0| 771548.1|755543.082716951| 2174893.0| 90.0| 0| 0| 0| 0| 456| 360| 1.5901743581|1.2494227099| 0| 976|128.9230769231|279.7630315931|78267.3538461539| 0| 0| 0| 1| 0| 0| 0| 0| 0| 134.08| 81.6428571429| 200.8181818182| 0| 0| 0| 0| 0| 0| 14| 1143| 11| 2209| 5808| 233| 6| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
| 22| 6|14/02/2018 08:40:31| 6989341| 16| 12| 1239| 2273.0| 744| 0| 77.4375| 190.8311535538| 976| 0| 189.4166666667| 347.6425694023| 502.4794183028| 4.0061001459|258864.481481481| 291724.14791076| 951098.0| 20.0| 6989341.0|465956.066666667|244363.896416351| 951098.0| 265831.0| 5980598.0|543690.727272727|460713.519752371| 1254338.0| 78.0| 0| 0| 0| 0| 332| 252| 2.2892000834|1.7169000625| 0| 976|121.1034482759|265.7086676402|70601.0960591133| 0| 0| 0| 1| 0| 0| 0| 0| 0|125.4285714286| 77.4375| 189.4166666667| 0| 0| 0| 0| 0| 0| 16| 1239| 12| 2273| 5808| 234| 7| 20| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
| 0| 0|14/02/2018 08:39:28| 112640480| 3| 0| 0| 0.0| 0| 0| 0.0| 0.0| 0| 0| 0.0| 0.0| 0.0| 0.0266334092| 5.632024E7| 203.6467529817| 5.6320384E7| 5.6320096E7| 1.1264048E8| 5.632024E7| 203.6467529817|5.6320384E7|5.6320096E7| 0.0| 0.0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0.0266334092| 0.0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 3| 0| 0| 0| -1| -1| 0| 0| 0.0| 0.0| 0.0| 0.0| 5.632024E7|203.6467529817|5.6320384E7|5.6320096E7|Benign|
| 0| 0|14/02/2018 08:42:17| 112641244| 3| 0| 0| 0.0| 0| 0| 0.0| 0.0| 0| 0| 0.0| 0.0| 0.0| 0.0266332286| 5.6320622E7| 62.2253967444| 5.6320666E7| 5.6320578E7|1.12641244E8| 5.6320622E7| 62.2253967444|5.6320666E7|5.6320578E7| 0.0| 0.0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0.0266332286| 0.0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 3| 0| 0| 0| -1| -1| 0| 0| 0.0| 0.0| 0.0| 0.0| 5.6320622E7| 62.2253967444|5.6320666E7|5.6320578E7|Benign|
| 80| 6|14/02/2018 08:47:14| 476513| 5| 3| 211| 463.0| 211| 0| 42.2| 94.3620686505| 463| 0| 154.3333333333| 267.3131746348|1414.4419984345|16.7886290615|68073.2857142857|115865.792656438| 237711.0| 24.0| 476513.0| 119128.25|137379.963358017| 238470.0| 108.0| 238634.0| 119317.0|167621.076693833| 237843.0| 791.0| 0| 0| 0| 0| 168| 104|10.4928931635|6.2957358981| 0| 463| 74.8888888889|161.4058893322|26051.8611111111| 0| 0| 0| 1| 0| 0| 0| 0| 0| 84.25| 42.2| 154.3333333333| 0| 0| 0| 0| 0| 0| 5| 211| 3| 463| 14600| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
| 80| 6|14/02/2018 08:47:15| 475048| 5| 3| 220| 472.0| 220| 0| 44.0| 98.38699101| 472| 0| 157.3333333333| 272.5093270575|1456.6949024099|16.8404034961| 67864.0|115746.933154476| 237494.0| 15.0| 475048.0| 118762.0|137096.759626185| 237853.0| 15.0| 237516.0| 118758.0|167472.584269784| 237179.0| 337.0| 0| 0| 0| 0| 168| 104| 10.525252185| 6.315151311| 0| 472| 76.8888888889|165.0669897681|27247.1111111111| 0| 0| 0| 1| 0| 0| 0| 0| 0| 86.5| 44.0| 157.3333333333| 0| 0| 0| 0| 0| 0| 5| 220| 3| 472| 14600| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
| 80| 6|14/02/2018 08:47:15| 474926| 5| 3| 220| 472.0| 220| 0| 44.0| 98.38699101| 472| 0| 157.3333333333| 272.5093270575|1457.0691012916|16.8447294947|67846.5714285714|115645.740842248| 237162.0| 15.0| 474926.0| 118731.5|136923.365842601| 237497.0| 15.0| 237732.0| 118866.0|167663.503100705| 237422.0| 310.0| 0| 0| 0| 0| 168| 104|10.5279559342|6.3167735605| 0| 472| 76.8888888889|165.0669897681|27247.1111111111| 0| 0| 0| 1| 0| 0| 0| 0| 0| 86.5| 44.0| 157.3333333333| 0| 0| 0| 0| 0| 0| 5| 220| 3| 472| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
| 80| 6|14/02/2018 08:47:16| 477471| 5| 3| 209| 461.0| 209| 0| 41.8| 93.4676414595| 461| 0| 153.6666666667| 266.1584740964|1403.2265833946|16.7549442793|68210.1428571429|116178.228792989| 238389.0| 17.0| 477471.0| 119367.75|137516.508224467| 238515.0| 149.0| 238887.0| 119443.5|168454.755588852| 238559.0| 328.0| 0| 0| 0| 0| 168| 104|10.4718401746|6.2831041048| 0| 461| 74.4444444444|160.5942955954|25790.5277777778| 0| 0| 0| 1| 0| 0| 0| 0| 0| 83.75| 41.8| 153.6666666667| 0| 0| 0| 0| 0| 0| 5| 209| 3| 461| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
| 80| 6|14/02/2018 08:47:16| 512758| 5| 3| 211| 463.0| 211| 0| 42.2| 94.3620686505| 463| 0| 154.3333333333| 267.3131746348|1314.4602327024|15.6019018718|73251.1428571429|124959.473740394| 256188.0| 10.0| 512758.0| 128189.5|148006.106082373| 256523.0| 10.0| 256563.0| 128281.5|180935.190276795| 256222.0| 341.0| 0| 0| 0| 0| 168| 104| 9.7511886699|5.8507132019| 0| 463| 74.8888888889|161.4058893322|26051.8611111111| 0| 0| 0| 1| 0| 0| 0| 0| 0| 84.25| 42.2| 154.3333333333| 0| 0| 0| 0| 0| 0| 5| 211| 3| 463| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
| 80| 6|14/02/2018 08:47:17| 476711| 5| 3| 206| 458.0| 206| 0| 41.2| 92.126000673| 458| 0| 152.6666666667| 264.4264232888|1392.8774456641|16.7816559719|68101.5714285714|115977.215079597| 238034.0| 16.0| 476711.0| 119177.75| 137207.36560009| 238274.0| 78.0| 238033.0| 119016.5|168007.864103143| 237816.0| 217.0| 0| 0| 0| 0| 168| 104|10.4885349824|6.2931209894| 0| 458| 73.7777777778|159.3783060659|25401.4444444444| 0| 0| 0| 1| 0| 0| 0| 0| 0| 83.0| 41.2| 152.6666666667| 0| 0| 0| 0| 0| 0| 5| 206| 3| 458| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
| 80| 6|14/02/2018 08:47:17| 476616| 5| 3| 211| 463.0| 211| 0| 42.2| 94.3620686505| 463| 0| 154.3333333333| 267.3131746348|1414.1363277775|16.7850009232| 68088.0|115553.073301117| 237285.0| 8.0| 476616.0| 119154.0|136576.972942001| 237558.0| 8.0| 237660.0| 118830.0|167563.093937776| 237315.0| 345.0| 0| 0| 0| 0| 168| 104| 10.490625577|6.2943753462| 0| 463| 74.8888888889|161.4058893322|26051.8611111111| 0| 0| 0| 1| 0| 0| 0| 0| 0| 84.25| 42.2| 154.3333333333| 0| 0| 0| 0| 0| 0| 5| 211| 3| 463| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
| 80| 6|14/02/2018 08:47:18| 477161| 5| 3| 211| 463.0| 211| 0| 42.2| 94.3620686505| 463| 0| 154.3333333333| 267.3131746348|1412.5211406632|16.7658295628|68165.8571428572|116324.530541969| 238504.0| 12.0| 477161.0| 119290.25|137729.560342905| 238719.0| 12.0| 238618.0| 119309.0|168449.805841384| 238421.0| 197.0| 0| 0| 0| 0| 168| 104|10.4786434767| 6.287186086| 0| 463| 74.8888888889|161.4058893322|26051.8611111111| 0| 0| 0| 1| 0| 0| 0| 0| 0| 84.25| 42.2| 154.3333333333| 0| 0| 0| 0| 0| 0| 5| 211| 3| 463| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
| 80| 6|14/02/2018 08:47:18| 474670| 5| 3| 214| 466.0| 214| 0| 42.8| 95.703709437| 466| 0| 155.3333333333| 269.0452254424|1432.5742094508|16.8538142288| 67810.0|115371.262364883| 236717.0| 15.0| 474670.0| 118667.5|136423.221362787| 236894.0| 55.0| 236992.0| 118496.0| 167297.22178805| 236793.0| 199.0| 0| 0| 0| 0| 168| 104| 10.533633893|6.3201803358| 0| 466| 75.5555555556|162.6246530443|26446.7777777778| 0| 0| 0| 1| 0| 0| 0| 0| 0| 85.0| 42.8| 155.3333333333| 0| 0| 0| 0| 0| 0| 5| 214| 3| 466| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
| 80| 6|14/02/2018 08:47:19| 476608| 5| 3| 209| 461.0| 209| 0| 41.8| 93.4676414595| 461| 0| 153.6666666667| 266.1584740964|1405.7674231234|16.7852826642|68086.8571428572|116165.357657993| 238154.0| 21.0| 476608.0| 119152.0|137536.399945614| 238349.0| 28.0| 238442.0| 119221.0|168305.556058022| 238231.0| 211.0| 0| 0| 0| 0| 168| 104|10.4908016651|6.2944809991| 0| 461| 74.4444444444|160.5942955954|25790.5277777778| 0| 0| 0| 1| 0| 0| 0| 0| 0| 83.75| 41.8| 153.6666666667| 0| 0| 0| 0| 0| 0| 5| 209| 3| 461| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
| 80| 6|14/02/2018 08:47:19| 479249| 5| 3| 215| 467.0| 215| 0| 43.0| 96.1509230325| 467| 0| 155.6666666667| 269.6225757116|1423.0598290242|16.6927839182|68464.1428571429|116522.185511928| 239061.0| 18.0| 479249.0| 119812.25|137797.592674606| 239270.0| 21.0| 239238.0| 119619.0|168903.768394906| 239052.0| 186.0| 0| 0| 0| 0| 168| 104|10.4329899489|6.2597939693| 0| 467| 75.7777777778|163.0312683029|26579.1944444444| 0| 0| 0| 1| 0| 0| 0| 0| 0| 85.25| 43.0| 155.6666666667| 0| 0| 0| 0| 0| 0| 5| 215| 3| 467| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
| 80| 6|14/02/2018 08:47:20| 475967| 5| 3| 215| 467.0| 215| 0| 43.0| 96.1509230325| 467| 0| 155.6666666667| 269.6225757116|1432.8724470394|16.8078879418|67995.2857142857|115929.081086548| 237703.0| 16.0| 475967.0| 118991.75|137195.538016305| 237904.0| 21.0| 237915.0| 118957.5|167975.337191208| 237734.0| 181.0| 0| 0| 0| 0| 168| 104|10.5049299636|6.3029579782| 0| 467| 75.7777777778|163.0312683029|26579.1944444444| 0| 0| 0| 1| 0| 0| 0| 0| 0| 85.25| 43.0| 155.6666666667| 0| 0| 0| 0| 0| 0| 5| 215| 3| 467| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
+——–+——–+——————-+————-+————+————+—————+—————+—————+—————+—————-+—————+—————+—————+—————-+—————+—————+————-+—————-+—————-+————+————+————+—————-+—————-+———–+———–+———–+—————-+—————-+———–+———–+————-+————-+————-+————-+————–+————–+————-+————+———–+———–+————–+————–+—————-+————+————+————+————+————+————+————–+————+————-+————–+—————-+—————-+————–+————–+—————-+————–+————–+—————-+—————-+—————-+—————-+—————-+—————–+—————–+—————–+—————-+———–+———-+———-+———-+————+————–+———–+———–+——+
only showing top 20 rows
In [ ]:
# The total number of attacks per label
IDS_df.select(‘Label’).groupBy(‘Label’).count().orderBy(‘count’, ascending=False).show()
+——————–+——-+
| Label| count|
+——————–+——-+
| Benign|6870186|
| DDOS attack-HOIC| 686012|
| DoS attacks-Hulk| 461912|
| Bot| 286191|
| FTP-BruteForce| 193360|
| SSH-Bruteforce| 187589|
|DoS attacks-SlowH…| 139890|
|DoS attacks-Golde…| 83016|
| Infilteration| 68871|
|DoS attacks-Slowl…| 21980|
|DDOS attack-LOIC-UDP| 1730|
| Brute Force -Web| 611|
| Brute Force -XSS| 230|
| SQL Injection| 87|
| Label| 34|
| 0| 1|
+——————–+——-+
In [120]:
IDS_df2 = IDS_df.withColumnRenamed(“Tot Fwd Pkts”,”tot_fw_pk”).withColumnRenamed(“Idle Max”,”idl_max”) \
.withColumnRenamed(“dst port”,”dst_port”).withColumnRenamed(“Idle Min”,”idl_min”) \
.withColumnRenamed(“TotLen Fwd Pkts”,”tot_l_fw_pkt”).withColumnRenamed(“Flow Duration”,”fl_dur”) \
.withColumnRenamed(“Flow Byts/s”,”fl_byt_s”).withColumnRenamed(“Fwd PSH Flags”,”fw_psh_flag”) \
.withColumnRenamed(“Active Max”,”atv_max”).withColumnRenamed(“Active Min”,”atv_min”) \
.withColumnRenamed(“Pkt Size Avg”,”pkt_size_avg”).withColumnRenamed(“Fwd Seg Size Avg”,”fw_seg_avg”) \
.withColumnRenamed(“Bwd Seg Size Avg”,”bw_seg_avg”)
In [72]:
IDS_df2.createOrReplaceTempView(“IDS”)
In [10]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
In [11]:
# Pramod Kumar Gouda u2002425
# Query 1 [Briefly explain]: Returns a set of objects with duplicate elements eliminated and it used for collection
sqlDF=spark.sql(“SELECT Protocol, collect_set(tot_fw_pk) as totalfwdpkts FROM IDS WHERE Protocol IS NOT NULL GROUP BY Protocol “)
sqlDF.show()
+——–+——————–+
|Protocol| totalfwdpkts|
+——–+——————–+
| 6|[2207, 356, 1982,…|
| 17|[97718, 95024, 15…|
| 0|[110, 52, 387, 13…|
+——–+——————–+
In [12]:
# Pramod Kumar Gouda u2002425
# Query 2 [Briefly explain]: Selecting the number of protocol based on there type
sqlDF=spark.sql(“SELECT Protocol, count(*) FROM IDS GROUP BY Protocol”)
sqlDF.show()
+——–+——–+
|Protocol|count(1)|
+——–+——–+
| null| 34|
| 6| 6976276|
| 17| 1919793|
| 0| 105597|
+——–+——–+
In [13]:
pandas_df=sqlDF.toPandas()
pandas_df.sort_values(by=’count(1)’,ascending=False).plot(x=’Protocol’,y=’count(1)’,kind=’bar’)
Out[13]:
In [56]:
# Virender Yadav u2002208
# Query 1 [Briefly explain]: Number of forwarded packets by there count
sqlDF=spark.sql(“SELECT tot_fw_pk, count(*) FROM IDS GROUP BY tot_fw_pk HAVING COUNT(tot_fw_pk) > 50 ” )
sqlDF.show()
+———+——–+
|tot_fw_pk|count(1)|
+———+——–+
| 148| 65|
| 31| 2440|
| 85| 154|
| 137| 82|
| 65| 664|
| 53| 928|
| 133| 84|
| 78| 268|
| 155| 69|
| 108| 203|
| 34| 1945|
| 126| 101|
| 115| 190|
| 101| 167|
| 81| 202|
| 28| 4640|
| 76| 215|
| 26| 6940|
| 27| 5227|
| 44| 1018|
+———+——–+
only showing top 20 rows
In [52]:
# Virender Yadav u2002208
# Query 2 [Briefly explain]: finding the average flow duration of packets
sqlDF=spark.sql(“SELECT avg(fl_dur) from IDS”)
sqlDF.show()
+——————–+
| avg(fl_dur)|
+——————–+
|1.0503308551024554E7|
+——————–+
In [23]:
# Nishitha Angali u2001782
# Query 1 [Briefly explain]: To find the count of maximum time a flow was idle before becomming active
# and minimum time a flow was idle before becomming active with conditions given as 0 in both cases
sqlDF=spark.sql(“SELECT count(idl_max),count(idl_min) FROM IDS where idl_max = 0 AND idl_min = 0”)
sqlDF.show()
+————–+————–+
|count(idl_max)|count(idl_min)|
+————–+————–+
| 7925325| 7925325|
+————–+————–+
In [26]:
# Nishitha Angali u2001782
# Query 2 [Briefly explain]: Counting the number of idel max group whose count is greater than 20
sqlDF=spark.sql(“SELECT idl_max,count(*) FROM IDS GROUP BY idl_max HAVING COUNT(idl_max) > 20 “)
sqlDF.show()
+———–+——–+
| idl_max|count(1)|
+———–+——–+
|5.6320958E7| 21|
| 1.001003E7| 29|
|1.0004066E7| 60|
|1.0014804E7| 23|
|1.0014585E7| 41|
|1.0014137E7| 23|
|5.6319468E7| 24|
|1.0001233E7| 24|
| 1.000114E7| 26|
|1.0010131E7| 31|
|5.6318028E7| 22|
|1.0014601E7| 38|
| 1.001432E7| 24|
|1.0003658E7| 35|
| 6.06E7| 36|
| 1.66E7| 35|
| 1.000975E7| 26|
|1.0001486E7| 29|
| 1.001427E7| 21|
|1.0013879E7| 29|
+———–+——–+
only showing top 20 rows
In [27]:
# Meetkumar Rasikbhai Patel u2001677
# Query 1 [Briefly explain]: Counting the number of destination ports by there type
sqlDF=spark.sql(“SELECT dst_port, count(*) from IDS GROUP BY dst_port “)
sqlDF.show()
+——–+——–+
|dst_port|count(1)|
+——–+——–+
| 38422| 29|
| 40386| 35|
| 35982| 39|
| 3997| 5|
| 1829| 4|
| 51415| 204|
| 26706| 4|
| 15846| 3|
| 51607| 184|
| 49308| 56|
| 50348| 216|
| 49855| 222|
| 50353| 206|
| 51393| 206|
| 51123| 210|
| 51595| 188|
| 63964| 26|
| 64519| 29|
| 57020| 69|
| 50223| 223|
+——–+——–+
only showing top 20 rows
In [53]:
# Meetkumar Rasikbhai Patel u2001677
# Query 2 [Briefly explain]: summing up the total length of forwarded packets
sqlDF=spark.sql(“SELECT sum(tot_l_fw_pkt) from IDS”)
sqlDF.show()
+—————–+
|sum(tot_l_fw_pkt)|
+—————–+
| 10422158811|
+—————–+
In [73]:
# Maulik Bhikhabhai Padhiyar u2002324
# Query 1 [Briefly explain]: selecting the different types of PSH flag
sqlDF=spark.sql(“SELECT DISTINCT fw_psh_flag from IDS “)
sqlDF.show()
+———–+
|fw_psh_flag|
+———–+
| null|
| 1|
| 0|
+———–+
In [68]:
# Maulik Bhikhabhai Padhiyar u2002324
# Query 2 [Briefly explain]: average byte rate which is transfered per second
sqlDF=spark.sql(“SELECT count(fl_byt_s) from IDS where fl_byt_s > 0”)
sqlDF.show()
+—————+
|count(fl_byt_s)|
+—————+
| 5781032|
+—————+
In [74]:
IDS_df2 = IDS_df2.na.drop()
In [121]:
# Pramod Kumar Gouda u2002425
# Analytical method 1: We are converting required columns from string to float to find skewness,is a measure of the
# asymmetry of the data around sample mean
from pyspark.sql.functions import col
selected_features = [‘tot_fw_pk’,’idl_max’,’atv_max’,’atv_min’,’idl_min’,’tot_l_fw_pkt’,’pkt_size_avg’,’fw_seg_avg’,’bw_seg_avg’]
IDS_selected_features_df = IDS_df2.select(*(col(c).cast(“float”).alias(c) for c in selected_features))
IDS_selected_features_df.show()
+———+———–+——-+——-+———–+————+————+———-+———-+
|tot_fw_pk| idl_max|atv_max|atv_min| idl_min|tot_l_fw_pkt|pkt_size_avg|fw_seg_avg|bw_seg_avg|
+———+———–+——-+——-+———–+————+————+———-+———-+
| 3.0| 5.632096E7| 0.0| 0.0| 5.632076E7| 0.0| 0.0| 0.0| 0.0|
| 3.0|5.6320816E7| 0.0| 0.0|5.6320652E7| 0.0| 0.0| 0.0| 0.0|
| 3.0|5.6319524E7| 0.0| 0.0|5.6319096E7| 0.0| 0.0| 0.0| 0.0|
| 15.0| 0.0| 0.0| 0.0| 0.0| 1239.0| 140.48| 82.6| 227.3|
| 14.0| 0.0| 0.0| 0.0| 0.0| 1143.0| 134.08| 81.64286| 200.81818|
| 16.0| 0.0| 0.0| 0.0| 0.0| 1239.0| 125.42857| 77.4375| 189.41667|
| 3.0|5.6320384E7| 0.0| 0.0|5.6320096E7| 0.0| 0.0| 0.0| 0.0|
| 3.0|5.6320664E7| 0.0| 0.0|5.6320576E7| 0.0| 0.0| 0.0| 0.0|
| 5.0| 0.0| 0.0| 0.0| 0.0| 211.0| 84.25| 42.2| 154.33333|
| 5.0| 0.0| 0.0| 0.0| 0.0| 220.0| 86.5| 44.0| 157.33333|
| 5.0| 0.0| 0.0| 0.0| 0.0| 220.0| 86.5| 44.0| 157.33333|
| 5.0| 0.0| 0.0| 0.0| 0.0| 209.0| 83.75| 41.8| 153.66667|
| 5.0| 0.0| 0.0| 0.0| 0.0| 211.0| 84.25| 42.2| 154.33333|
| 5.0| 0.0| 0.0| 0.0| 0.0| 206.0| 83.0| 41.2| 152.66667|
| 5.0| 0.0| 0.0| 0.0| 0.0| 211.0| 84.25| 42.2| 154.33333|
| 5.0| 0.0| 0.0| 0.0| 0.0| 211.0| 84.25| 42.2| 154.33333|
| 5.0| 0.0| 0.0| 0.0| 0.0| 214.0| 85.0| 42.8| 155.33333|
| 5.0| 0.0| 0.0| 0.0| 0.0| 209.0| 83.75| 41.8| 153.66667|
| 5.0| 0.0| 0.0| 0.0| 0.0| 215.0| 85.25| 43.0| 155.66667|
| 5.0| 0.0| 0.0| 0.0| 0.0| 215.0| 85.25| 43.0| 155.66667|
+———+———–+——-+——-+———–+————+————+———-+———-+
only showing top 20 rows
In [81]:
from pyspark.sql import functions as f
IDS_selected_features_df.select(f.skewness(IDS_selected_features_df[‘tot_fw_pk’])).show()
+——————-+
|skewness(tot_fw_pk)|
+——————-+
| 77.74947867361696|
+——————-+
In [83]:
# Pramod Kumar Gouda u2002425
# Analytical method 2: I am finding the correlation between two columns where if one column increases its corelated
# with other column and if another column value decreases
IDS_selected_features_df.stat.corr(“atv_max”,”atv_min”)
Out[83]:
0.7414425772390605
In [84]:
# Pramod Kumar Gouda u2002425
# Analytical method 3: kernel density estimate on a pyspark dataframe column and use it for
# creating a new column with the estimates
from pyspark.mllib.stat import KernelDensity
dat_rdd = IDS_selected_features_df.select(“tot_fw_pk”).rdd
dat_rdd_data = dat_rdd.map(lambda x: x[0])
kd = KernelDensity()
kd.setSample(dat_rdd_data)
kd.estimate([13.0,14.0])
Out[84]:
array([0.0097561 , 0.01106945])
In [86]:
# Virender Yadav u2002208
# Analytical method 1: finding the kurtosis and is about tails of the distribution,measures of outliners
IDS_selected_features_df.select(f.kurtosis(IDS_selected_features_df[‘atv_min’])).show()
+——————+
| kurtosis(atv_min)|
+——————+
|4022.4866270803145|
+——————+
In [91]:
# Virender Yadav u2002208
# Analytical method 2: find the percentile of a 80th %
IDS_df2.groupby(‘label’).agg(f.expr(‘percentile(atv_max, array(0.80))’)[0].alias(‘%80’)).show()
+——————–+———+
| label| %80|
+——————–+———+
| SSH-Bruteforce| 0.0|
| Label| null|
| Infilteration| 0.0|
| 0| 0.0|
| SQL Injection| 0.0|
|DoS attacks-Slowl…|7678920.2|
| Benign| 0.0|
|DoS attacks-SlowH…| 0.0|
| Bot| 0.0|
|DoS attacks-Golde…| 441.0|
| Brute Force -XSS| 0.0|
| FTP-BruteForce| 0.0|
|DDOS attack-LOIC-UDP| 0.0|
| DoS attacks-Hulk| 0.0|
| Brute Force -Web|3999879.0|
| DDOS attack-HOIC| 0.0|
+——————–+———+
In [95]:
# Virender Yadav u2002208
# Analytical method 3: calculating standard deviation its value is how far from the normal ,squareroot of variance
IDS_df2.agg(f.stddev(“atv_max”)).show()
+——————–+
|stddev_samp(atv_max)|
+——————–+
| 1749057.5452774959|
+——————–+
In [85]:
# Nishitha Angali u2001782
# Analytical method 1: To find skewness,is a measure of the asymmetry of the data around sample mean
IDS_selected_features_df.select(f.skewness(IDS_selected_features_df[‘idl_max’])).show()
+—————–+
|skewness(idl_max)|
+—————–+
|992.0339103023476|
+—————–+
In [97]:
# Nishitha Angali u2001782
# Analytical method 2: finding the correlation between two columns where if one column increases its corelated
# with other column and if another column value decreases
IDS_selected_features_df.stat.corr(“idl_max”,”idl_min”)
Out[97]:
0.33075364446687444
In [101]:
# Nishitha Angali u2001782
# Analytical method 3: find the percentile of a 75th %
IDS_df2.groupby(‘protocol’).agg(f.expr(‘percentile(idl_max, array(0.75))’)[0].alias(‘%75’)).show()
+——–+——-+
|protocol| %75|
+——–+——-+
| null| null|
| 6| 0.0|
| 17| 0.0|
| 0|5.632E7|
+——–+——-+
In [116]:
# Meetkumar Rasikbhai Patel u2001677
# Analytical method 1: finding the kurtosis and is about tails of the distribution,measures of outliners
IDS_selected_features_df.select(f.kurtosis(IDS_selected_features_df[‘tot_l_fw_pkt’])).show()
+———————-+
|kurtosis(tot_l_fw_pkt)|
+———————-+
| 1529504.304325319|
+———————-+
In [113]:
# Meetkumar Rasikbhai Patel u2001677
# Analytical method 2: calucating the average value for the column
IDS_df2.select(f.mean(“tot_l_fw_pkt”)).show()
+——————+
| avg(tot_l_fw_pkt)|
+——————+
|1157.8033234070226|
+——————+
In [122]:
# Meetkumar Rasikbhai Patel u2001677
# Analytical method 3: finding the correlation between two columns where if one column increases its corelated
# with other column and if another column value decreases
IDS_selected_features_df.stat.corr(“fw_seg_avg”,”bw_seg_avg”)
Out[122]:
0.37286275767327126
In [119]:
# Maulik Bhikhabhai Padhiyar u2002324
# Analytical method 1: To find skewness,is a measure of the asymmetry of the data around sample mean
IDS_selected_features_df.select(f.skewness(IDS_selected_features_df[‘pkt_size_avg’])).show()
+———————-+
|skewness(pkt_size_avg)|
+———————-+
| 3.4566237116943674|
+———————-+
In [123]:
# Maulik Bhikhabhai Padhiyar u2002324
# Analytical method 2:
IDS_df2.agg(f.stddev(“idl_min”)).show()
+——————–+
|stddev_samp(idl_min)|
+——————–+
| 8.435726565284234E7|
+——————–+
In [124]:
# Maulik Bhikhabhai Padhiyar u2002324
# Analytical method 3: Obtaining the maximum value in column
IDS_df2.agg(f.max(“idl_max”)).show()
+————+
|max(idl_max)|
+————+
| 9.79781E11|
+————+
In [ ]:
# pramod Kumar Gouda u2002425
# Machine Learning Technique:
# What to achieve:
from pyspark.mllib.feature import Word2Vec
wv_rdd = IDS_df2.rdd
inp = wv_rdd.map(lambda row: row.split(” “))
word2vec = Word2Vec()
model = word2vec.fit(inp)
synonyms = model.findSynonyms(‘Benign’, 5)
for word, cosine_distance in synonyms:
print(“{}: {}”.format(word, cosine_distance))
In [ ]:
# Virender Yadav u2002208
# Machine Learning Technique:
# What to achieve:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
# Load and parse the data
wv_rdd = IDS_df2.rdd
parsedData = wv_rdd.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(‘ ‘)]))
# Index documents with unique IDs
corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
# Cluster the documents into three topics using LDA
ldaModel = LDA.train(corpus, k=3)
# Output topics. Each is a distribution over words (matching word count vectors)
print(“Learned topics (as distributions over vocab of ” + str(ldaModel.vocabSize())
+ ” words):”)
topics = ldaModel.topicsMatrix()
for topic in range(3):
print(“Topic ” + str(topic) + “:”)
for word in range(0, ldaModel.vocabSize()):
print(” ” + str(topics[word][topic]))
In [ ]:
# Nishitha Angali u2001782
# Machine Learning Technique:
# What to achieve:
from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
# Load and parse the data
wv_rdd = IDS_df2.rdd
parsedData = wv_rdd.map(lambda line: array([float(x) for x in line.split(‘,’)]))
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode=”random”)
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point – center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print(“Within Set Sum of Squared Error = ” + str(WSSSE))
In [ ]:
# Meetkumar Rasikbhai Patel u2001677
# Machine Learning Technique:
# What to achieve:
from pyspark.mllib.feature import ElementwiseProduct
from pyspark.mllib.linalg import Vectors
data = sc.textFile(“data/mllib/kmeans_data.txt”)
parsedData = data.map(lambda x: [float(t) for t in x.split(” “)])
# Create weight vector.
transformingVector = Vectors.dense([0.0, 1.0, 2.0])
transformer = ElementwiseProduct(transformingVector)
# Batch transform
transformedData = transformer.transform(parsedData)
# Single-row transform
transformedData2 = transformer.transform(parsedData.first())
In [ ]:
# Maulik Bhikhabhai Padhiyar u2002324
# Machine Learning Technique:
# What to achieve:
from pyspark.mllib.feature import Normalizer
from pyspark.mllib.util import MLUtils
data = MLUtils.loadLibSVMFile(sc, “data/mllib/sample_libsvm_data.txt”)
labels = data.map(lambda x: x.label)
features = data.map(lambda x: x.features)
normalizer1 = Normalizer()
normalizer2 = Normalizer(p=float(“inf”))
# Each sample in data1 will be normalized using $L^2$ norm.
data1 = labels.zip(normalizer1.transform(features))
# Each sample in data2 will be normalized using $L^\infty$ norm.
data2 = labels.zip(normalizer2.transform(features))
In [125]:
# install nbconvert
!pip install nbconvert
Requirement already satisfied: nbconvert in /usr/local/lib/python3.6/dist-packages (5.6.1)
Requirement already satisfied: bleach in /usr/local/lib/python3.6/dist-packages (from nbconvert) (3.2.1)
Requirement already satisfied: jinja2>=2.4 in /usr/local/lib/python3.6/dist-packages (from nbconvert) (2.11.2)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/dist-packages (from nbconvert) (0.6.0)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/dist-packages (from nbconvert) (4.3.3)
Requirement already satisfied: nbformat>=4.4 in /usr/local/lib/python3.6/dist-packages (from nbconvert) (5.0.8)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert) (0.8.4)
Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from nbconvert) (2.6.1)
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/dist-packages (from nbconvert) (0.3)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.6/dist-packages (from nbconvert) (4.7.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert) (1.4.3)
Requirement already satisfied: testpath in /usr/local/lib/python3.6/dist-packages (from nbconvert) (0.4.4)
Requirement already satisfied: webencodings in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert) (0.5.1)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert) (1.15.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert) (20.7)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2>=2.4->nbconvert) (1.1.1)
Requirement already satisfied: decorator in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.2->nbconvert) (4.4.2)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.2->nbconvert) (0.2.0)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.4->nbconvert) (2.6.0)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from packaging->bleach->nbconvert) (2.4.7)
In [ ]:
# convert ipynb to html
# file name: Group115_CN7031_CN7031.ipynb
!jupyter nbconvert –to html Group115_CN7031.ipynb
We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.
Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.
Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.
Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.
Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.
Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.
We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.
Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.
You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.
Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.
Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.
From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.
Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.
Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.
You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.
You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.
Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.
We create perfect papers according to the guidelines.
We seamlessly edit out errors from your papers.
We thoroughly read your final draft to identify errors.
Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!
Dedication. Quality. Commitment. Punctuality
Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.
We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.
We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.
We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.
We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.