Need two Big Data Query & Analysis using Spark SQL, three queries Advanced Analytics using PySpark

Don't use plagiarized sources. Get Your Custom Essay on

Just from $13/Page

Order Essay

SCHOOL OF ARCHITECTURE, COMPUTING &
ENGINEERING

Submission instructions

• Cover sheet to be attached to the front of the assignment when submitted

• Question paper to be attached to assignment when submitted

• All pages to be numbered sequentially

• All work has to be presented in a ready to submit state upon arrival at
the ACE Helpdesk. Assignment cover sheets or stationery will NOT be
provided by Helpdesk staff

Module code

CN7031

Module title

Big Data Analytics

Module leader

Amin Karami

Assignment tutor
A Karami, F Jafari, MA Ghazanfar, N Qazi

Assignment title

Big Data Analytics: Coursework

Assignment number

Weighting

100%

Handout date

Week 5 (30th October 2020)

Submission date

Presentation: Week 12 (14th-18th December 2020)

Turnitin Submission: 25th December 2020 (midnight)

Learning outcomes
assessed by this
assignment

1-8

Turnitin submission
requirement

Yes

Turnitin GradeMark feedback
used?

UEL Plus Grade Book
submission used?

No
UEL Plus Grade Book
feedback used?

Other electronic
system used?

Yes
Are submissions / feedback
totally electronic?

Yes

Additional information

Form of assessment:

Individual work Group work

For group work assessment which requires members to submit both individual
and group work aspects for the assignment, the work should be submitted as:

Consolidated single document Separately by each member

Number of assignment copies required:

1 2 Other

Assignment to be presented in the following format:

On-line submission
Stapled once in the top left-hand corner
Glue bound
Spiral bound
Placed in a A4 ring bound folder (not lever arch)

Note: To students submitting work on A3/A2 boards, work has to be

contained in suitable protective case to ensure any damage to work is
avoided.

Soft copy:

CD (to be attached to the work in an envelope or purpose made wallet
adhered to the rear)
USB (to be attached to the work in an envelope or purpose made wallet

adhered to the rear)
Soft copy not required

CN7031 – Big Data Analytics

Group assignment 2020-21 Academic Year

This coursework (CRWK) must be attempted in the groups of 4 or 5 students. This

coursework is divided into two sections: (1) Big Data analytics on a real case study and (2)

group presentation. All the group members must attend the presentation. Presentation

would be online through Microsoft Teams. If you do not turn up in the presentation date

with the video call, you will fail the module.

Overall mark for CRWK comes from two main activities as follows:

1- Big Data Analytics report (around 3,000 words, with a tolerance of ± 10%) in HTML

format (60%)

2- Presentation (40%)

Marking Scheme

Topic Total

mark

Remarks

(breakdown of marks for each sub-task)

Big Data

Analytics using

Spark SQL

(6) Providing 2 queries using Spark SQL.

(14) Developing advanced SQL statements. Refer to:

https://spark.apache.org/docs/3.0.0/sql-ref.html

(10) Visualizing the outcomes of queries into the graphical and

textual format, and be able to interpret them.

Big Data
Analytics using

PySpark

(45) Analyzing the dataset through 3 statistical analytics

methods including advanced descriptive statistics,

correlation, hypothesis testing, density estimation, etc.

(15) Designing one classifier, then evaluate and visualize the

accuracy/performance.

Applying a multi-class classifier is considered for full mark.

Documentation 10 (10) Write down a well-organized report for a programming and

analytics project.

Total: 100

IMPORTANT: you must use CRWK template in the HTML format, otherwise it will be

counted as plagiarism and your group mark would be zero. Please refer to the “THE

FORMAT OF FINAL SUBMISSION” section.

Good Luck!

https://spark.apache.org/docs/3.0.0/sql-ref.html

Big Data Analytics using Spark
CN7031 – Big Data Analytics

(1) Understanding Dataset: CSE-CIC-IDS20181

This dataset was originally created by the University of New Brunswick for analyzing DDoS

data. You can find the full dataset and its description here. The dataset itself was based on

logs of the university’s servers, which found various DoS attacks throughout the publicly

available period to generate totally 80 attributes with 6.40GB size. We will use about 2.6GB

of the data to process it with the restricted PCs to 4GB RAM. Download it from here. When

writing machine learning or statistical analysis for this data, note that the Label column is

arguably the most important portion of data, as it determines if the packets sent are malicious

or not.

a) The features are described in the “IDS2018_Features.xlsx” file in Moodle page.

b) The labels are as follows:

• “Label”: normal traffic

• “Benign”: susceptible to DoS attack

c) In this coursework, we use more than 8.2-million records with the size of 2.6GB. As

a big data specialist, firstly, we should read and understand the features, then apply

modeling techniques. If you want to see a few records of this dataset, you can either

use [1] Hadoop HDFS and Hive, [2] Spark SQL or [3] RDD for printing a few records

for your understanding.

1 Source: https://registry.opendata.aws/cse-cic-ids2018/ & https://www.unb.ca/cic/datasets/ids-2018.html

https://www.unb.ca/cic/datasets/ids-2018.html

https://tinyurl.com/yyqf555f

https://registry.opendata.aws/cse-cic-ids2018/

https://www.unb.ca/cic/datasets/ids-2018.html

(2) Big Data Query & Analysis using Spark SQL [30 marks]

This task is using Spark SQL for converting big sized raw data into useful information. Each

member of a group should implement 2 complex SQL queries (refer to the marking

scheme). Apply appropriate visualization tools to present your findings numerically and

graphically. Interpret shortly your findings.

You can use https://spark.apache.org/docs/3.0.0/sql-ref.html for more information.

• What do you need to put in the HTML report per student?

1. At least two Spark SQL queries.

2. A short explanation of the queries.

3. The working solution, i.e., plot or table.

• Tip: The mark for this section depends on the level of your queries complexity, for

instance using the simple select query is not supposed for a full mark.

(3) Advanced Analytics using PySpark [60 marks]

In this section, you will conduct advanced analytics using PySpark.

3.1. Analyze and Interpret Big Data using PySpark (45 marks)

Every member of a group should analyze data through 3 analytical methods (e.g.,

advanced descriptive statistics, correlation, hypothesis testing, density estimation, etc.). You

need to present your work numerically and graphically. Apply tooltip text, legend, title, X-Y

labels etc. accordingly.

Note: we need a working solution without system or logical error for the good/full mark.

3.2. Design and Build a Machine Learning (ML) technique (15 marks)

Every member of a group should go over https://spark.apache.org/docs/3.0.0/ml-guide.html

and apply one ML technique. You can apply one the following approaches: Classification,

Regression, Clustering, Dimensionality Reduction, Feature Extraction, Frequent Pattern

mining or Optimization. Explain and evaluate your model and its results into the numerical

and/or graphical representations.

Note: If you are 4 students in a group, you should develop 4 different models. If you have

a similar model, the mark would be zero.

https://spark.apache.org/docs/3.0.0/sql-ref.html

https://spark.apache.org/docs/3.0.0/ml-guide.html

(4) Documentation [10 marks]

Your final report must follow the “The format of final submission” section. Your work must

demonstrate appropriate understanding of building a user friendly, efficient and

comprehensive analytics report for a big data project to help move users (readers) around

to find the relevant contents.

THE FORMAT OF FINAL SUBMISSION
1- You can use either Google Colab (https://colab.research.google.com/) or Ubuntu

VMWare for this CRWK.

2- You have to convert the source code (*.ipynb) to HTML. Watch the video in the Moodle

about “how to submit the report in HTML format”.

3- Upload ONLY one single HTML file per group into Turnitin in Moodle. One member of

each group must submit the work, NOT all members. The name of the file must be in the

format of “Your-Group-ID_CN7031”, such as Group200_CN7031.html if you are

belonging to the group 200.

4- The submission link will be available from week 10, and you are free to amend your

submitted file several times before submission deadline. Your last submission will be

saved in the Moodle database for marking.

PLAGIARISM

If there are copied PySpark codes from somewhere or someone else, all the group members

will get zero, and should attend the “breach of regulation” committee for further explanations

and the probable additional penalties.

FEEDBACK TO STUDENTS

Feedback is central to learning and is provided to students to develop their knowledge,

understanding, skills and to help promote learning and facilitate improvement.

• Feedback will be provided as soon as possible after the student has completed

the assessment task.

• Feedback will be in relation to the learning outcomes and assessment criteria.

• It will be offered via Turnitin GradeMark or Moodle post.

As the feedback (including marks) is provided before Award & Field Board, marks are:

• Provisional

• available for External Examiner scrutiny

• subject to change and approval by the Assessment Board

https://colab.research.google.com/

ASSESSMENT FORM FOR PRESENTATION
CN7031 – Big Data Analytics (40%)

Students have to fill this section correctly. Assessors will not be liable for any mistakes.

Group No: ……………….

1
st

Student (full name and ID):

2
nd

Student (full name and ID):

3
rd

Student (full name and ID):

4
th
Student (full name and ID):

5
th
Student (full name and ID):

Assessment Criteria:

Criteria 1st 2nd 3rd 4th 5th Mark

Demonstrate/interpret Spark SQL queries 10

Understand Spark and its mechanism 5

Demonstrate/interpret PySpark codes 15

Ability to answer questions 10

Overall mark 40

Date & Time: ………………………….

Assessors’ signature and comments:

Big Data Analytics [CN7031] CRWK 2020-21¶

Group ID: [115]¶

Student 1: Pramod Kumar Gouda u2002425
Student 2: Virender Yadav u2002208
Student 3: Nishitha Angali u2001782
Student 4: Meetkumar Rasikbhai Patel u2001677
Student 5: Maulik Bhikhabhai Padhiyar u2002324

If you want to add comments on your group work, please write it here for us:

Initiate and Configure Spark¶

In [1]:

!sudo apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
!tar xf spark-3.0.1-bin-hadoop3.2.tgz
!pip install -q findspark

Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease
Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 InRelease
Hit:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Release
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release
Get:9 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:10 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ Packages [40.7 kB]
Get:11 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1,372 kB]
Get:12 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease [21.3 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:14 http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages [237 kB]
Get:15 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [1,814 kB]
Get:16 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages [15.3 kB]
Get:19 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main Sources [1,699 kB]
Get:20 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [2,136 kB]
Get:21 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [2,243 kB]
Get:22 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [266 kB]
Get:23 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [53.8 kB]
Get:24 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main amd64 Packages [870 kB]
Get:25 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic/main amd64 Packages [46.5 kB]
Fetched 11.1 MB in 3s (3,933 kB/s)
Reading package lists… Done

In [2]:

# Using operating system dependent functionality to read or write a file
import os
os.environ[“JAVA_HOME”]=”/usr/lib/jvm/java-8-openjdk-amd64″
os.environ[“SPARK_HOME”]=”/content/spark-3.0.1-bin-hadoop3.2″
import findspark
findspark.init()

In [3]:

# linking with SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.master(“local[*]”).appName(‘Group115’).getOrCreate()
# Note: If you want to work with RDD, you should use: “from pyspark import SparkContext, SparkConf”

Load Data¶

In [4]:

# Load Data
df1 = spark.read.load(“IDS2018/02-14-2018.csv”, format=”csv”, inferSchema=True, header=True)
df2 = spark.read.load(“IDS2018/02-15-2018.csv”, format=”csv”, inferSchema=True, header=True)
df3 = spark.read.load(“IDS2018/02-16-2018.csv”, format=”csv”, inferSchema=True, header=True)
df4 = spark.read.load(“IDS2018/02-21-2018.csv”, format=”csv”, inferSchema=True, header=True)
df5 = spark.read.load(“IDS2018/02-22-2018.csv”, format=”csv”, inferSchema=True, header=True)
df6 = spark.read.load(“IDS2018/02-23-2018.csv”, format=”csv”, inferSchema=True, header=True)
df7 = spark.read.load(“IDS2018/02-28-2018.csv”, format=”csv”, inferSchema=True, header=True)
df8 = spark.read.load(“IDS2018/03-01-2018.csv”, format=”csv”, inferSchema=True, header=True)
df9 = spark.read.load(“IDS2018/03-02-2018.csv”, format=”csv”, inferSchema=True, header=True)

In [5]:

from functools import reduce
from pyspark.sql import DataFrame
# Create a list of dataframes
dfs = [df1, df2, df3, df4, df5, df6, df7, df2, df9]
# Create a merged dataframe
IDS_df = reduce(DataFrame.unionAll, dfs)

In [ ]:

# Print DF to make sure it is working
IDS_df.show()

+——–+——&# |Dst Port|Protocol| Timestamp|Flow +——–+——&# | 0| 0|14/02/2018 08:31:01| | 0| 0|14/02/2018 08:33:50| | 0| 0|14/02/2018 08:36:39| | 22| 6|14/02/2018 08:40:13| | 22| 6|14/02/2018 08:40:23| | 22| 6|14/02/2018 08:40:31| | 0| 0|14/02/2018 08:39:28| | 0| 0|14/02/2018 08:42:17| | 80| 6|14/02/2018 08:47:14| | 80| 6|14/02/2018 08:47:15| | 80| 6|14/02/2018 08:47:15| | 80| 6|14/02/2018 08:47:16| | 80| 6|14/02/2018 08:47:16| | 80| 6|14/02/2018 08:47:17| | 80| 6|14/02/2018 08:47:17| | 80| 6|14/02/2018 08:47:18| | 80| 6|14/02/2018 08:47:18| | 80| 6|14/02/2018 08:47:19| | 80| 6|14/02/2018 08:47:19| | 80| 6|14/02/2018 08:47:20| +——–+——&# only showing top 20 rows 8211;+——————-+————-+————+————+—————+—————+—————+—————+—————-+—————+—————+—————+—————-+—————+—————+————-+—————-+—————-+————+————+————+—————-+—————-+———–+———–+———–+—————-+—————-+———–+———–+————-+————-+————-+————-+————–+————–+————-+————+———–+———–+————–+————–+—————-+————+————+————+————+————+————+————–+————+————-+————–+—————-+—————-+————–+————–+—————-+————–+————–+—————-+—————-+—————-+—————-+—————-+—————–+—————–+—————–+—————-+———–+———-+———-+———-+————+————–+———–+———–+——+
Duration|Tot Fwd Pkts|Tot Bwd Pkts|TotLen Fwd Pkts|TotLen Bwd Pkts|Fwd Pkt Len Max|Fwd Pkt Len Min|Fwd Pkt Len Mean|Fwd Pkt Len Std|Bwd Pkt Len Max|Bwd Pkt Len Min|Bwd Pkt Len Mean|Bwd Pkt Len Std| Flow Byts/s| Flow Pkts/s| Flow IAT Mean| Flow IAT Std|Flow IAT Max|Flow IAT Min| Fwd IAT Tot| Fwd IAT Mean| Fwd IAT Std|Fwd IAT Max|Fwd IAT Min|Bwd IAT Tot| Bwd IAT Mean| Bwd IAT Std|Bwd IAT Max|Bwd IAT Min|Fwd PSH Flags|Bwd PSH Flags|Fwd URG Flags|Bwd URG Flags|Fwd Header Len|Bwd Header Len| Fwd Pkts/s| Bwd Pkts/s|Pkt Len Min|Pkt Len Max| Pkt Len Mean| Pkt Len Std| Pkt Len Var|FIN Flag Cnt|SYN Flag Cnt|RST Flag Cnt|PSH Flag Cnt|ACK Flag Cnt|URG Flag Cnt|CWE Flag Count|ECE Flag Cnt|Down/Up Ratio| Pkt Size Avg|Fwd Seg Size Avg|Bwd Seg Size Avg|Fwd Byts/b Avg|Fwd Pkts/b Avg|Fwd Blk Rate Avg|Bwd Byts/b Avg|Bwd Pkts/b Avg|Bwd Blk Rate Avg|Subflow Fwd Pkts|Subflow Fwd Byts|Subflow Bwd Pkts|Subflow Bwd Byts|Init Fwd Win Byts|Init Bwd Win Byts|Fwd Act Data Pkts|Fwd Seg Size Min|Active Mean|Active Std|Active Max|Active Min| Idle Mean| Idle Std| Idle Max| Idle Min| Label|
8211;+——————-+————-+————+————+—————+—————+—————+—————+—————-+—————+—————+—————+—————-+—————+—————+————-+—————-+—————-+————+————+————+—————-+—————-+———–+———–+———–+—————-+—————-+———–+———–+————-+————-+————-+————-+————–+————–+————-+————+———–+———–+————–+————–+—————-+————+————+————+————+————+————+————–+————+————-+————–+—————-+—————-+————–+————–+—————-+————–+————–+—————-+—————-+—————-+—————-+—————-+—————–+—————–+—————–+—————-+———–+———-+———-+———-+————+————–+———–+———–+——+
112641719| 3| 0| 0| 0.0| 0| 0| 0.0| 0.0| 0| 0| 0.0| 0.0| 0.0| 0.0266331163| 5.63208595E7| 139.3000358938| 5.6320958E7| 5.6320761E7|1.12641719E8| 5.63208595E7| 139.3000358938|5.6320958E7|5.6320761E7| 0.0| 0.0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0.0266331163| 0.0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 3| 0| 0| 0| -1| -1| 0| 0| 0.0| 0.0| 0.0| 0.0|5.63208595E7|139.3000358938|5.6320958E7|5.6320761E7|Benign|
112641466| 3| 0| 0| 0.0| 0| 0| 0.0| 0.0| 0| 0| 0.0| 0.0| 0.0| 0.0266331761| 5.6320733E7| 114.5512985522| 5.6320814E7| 5.6320652E7|1.12641466E8| 5.6320733E7| 114.5512985522|5.6320814E7|5.6320652E7| 0.0| 0.0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0.0266331761| 0.0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 3| 0| 0| 0| -1| -1| 0| 0| 0.0| 0.0| 0.0| 0.0| 5.6320733E7|114.5512985522|5.6320814E7|5.6320652E7|Benign|
112638623| 3| 0| 0| 0.0| 0| 0| 0.0| 0.0| 0| 0| 0.0| 0.0| 0.0| 0.0266338483| 5.63193115E7| 301.9345955667| 5.6319525E7| 5.6319098E7|1.12638623E8| 5.63193115E7| 301.9345955667|5.6319525E7|5.6319098E7| 0.0| 0.0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0.0266338483| 0.0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 3| 0| 0| 0| -1| -1| 0| 0| 0.0| 0.0| 0.0| 0.0|5.63193115E7|301.9345955667|5.6319525E7|5.6319098E7|Benign|
6453966| 15| 10| 1239| 2273.0| 744| 0| 82.6| 196.7412368715| 976| 0| 227.3| 371.6778922072| 544.1615279659| 3.8735871865| 268915.25|247443.778966007| 673900.0| 22.0| 6453966.0|460997.571428571|123109.423587757| 673900.0| 229740.0| 5637902.0|626433.555555556| 455082.21422401| 1167293.0| 554.0| 0| 0| 0| 0| 488| 328| 2.3241523119|1.5494348746| 0| 976|135.0769230769|277.8347599674|77192.1538461539| 0| 0| 0| 1| 0| 0| 0| 0| 0| 140.48| 82.6| 227.3| 0| 0| 0| 0| 0| 0| 15| 1239| 10| 2273| 65535| 233| 6| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
8804066| 14| 11| 1143| 2209.0| 744| 0| 81.6428571429| 203.7455453568| 976| 0| 200.8181818182| 362.2498635422| 380.7331748762| 2.839597068|366836.083333333|511356.609732762| 1928102.0| 21.0| 8804066.0|677235.846153846|532416.970958985| 1928102.0| 246924.0| 7715481.0| 771548.1|755543.082716951| 2174893.0| 90.0| 0| 0| 0| 0| 456| 360| 1.5901743581|1.2494227099| 0| 976|128.9230769231|279.7630315931|78267.3538461539| 0| 0| 0| 1| 0| 0| 0| 0| 0| 134.08| 81.6428571429| 200.8181818182| 0| 0| 0| 0| 0| 0| 14| 1143| 11| 2209| 5808| 233| 6| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
6989341| 16| 12| 1239| 2273.0| 744| 0| 77.4375| 190.8311535538| 976| 0| 189.4166666667| 347.6425694023| 502.4794183028| 4.0061001459|258864.481481481| 291724.14791076| 951098.0| 20.0| 6989341.0|465956.066666667|244363.896416351| 951098.0| 265831.0| 5980598.0|543690.727272727|460713.519752371| 1254338.0| 78.0| 0| 0| 0| 0| 332| 252| 2.2892000834|1.7169000625| 0| 976|121.1034482759|265.7086676402|70601.0960591133| 0| 0| 0| 1| 0| 0| 0| 0| 0|125.4285714286| 77.4375| 189.4166666667| 0| 0| 0| 0| 0| 0| 16| 1239| 12| 2273| 5808| 234| 7| 20| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
112640480| 3| 0| 0| 0.0| 0| 0| 0.0| 0.0| 0| 0| 0.0| 0.0| 0.0| 0.0266334092| 5.632024E7| 203.6467529817| 5.6320384E7| 5.6320096E7| 1.1264048E8| 5.632024E7| 203.6467529817|5.6320384E7|5.6320096E7| 0.0| 0.0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0.0266334092| 0.0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 3| 0| 0| 0| -1| -1| 0| 0| 0.0| 0.0| 0.0| 0.0| 5.632024E7|203.6467529817|5.6320384E7|5.6320096E7|Benign|
112641244| 3| 0| 0| 0.0| 0| 0| 0.0| 0.0| 0| 0| 0.0| 0.0| 0.0| 0.0266332286| 5.6320622E7| 62.2253967444| 5.6320666E7| 5.6320578E7|1.12641244E8| 5.6320622E7| 62.2253967444|5.6320666E7|5.6320578E7| 0.0| 0.0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0.0266332286| 0.0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0.0| 0.0| 0.0| 0| 0| 0| 0| 0| 0| 3| 0| 0| 0| -1| -1| 0| 0| 0.0| 0.0| 0.0| 0.0| 5.6320622E7| 62.2253967444|5.6320666E7|5.6320578E7|Benign|
476513| 5| 3| 211| 463.0| 211| 0| 42.2| 94.3620686505| 463| 0| 154.3333333333| 267.3131746348|1414.4419984345|16.7886290615|68073.2857142857|115865.792656438| 237711.0| 24.0| 476513.0| 119128.25|137379.963358017| 238470.0| 108.0| 238634.0| 119317.0|167621.076693833| 237843.0| 791.0| 0| 0| 0| 0| 168| 104|10.4928931635|6.2957358981| 0| 463| 74.8888888889|161.4058893322|26051.8611111111| 0| 0| 0| 1| 0| 0| 0| 0| 0| 84.25| 42.2| 154.3333333333| 0| 0| 0| 0| 0| 0| 5| 211| 3| 463| 14600| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
475048| 5| 3| 220| 472.0| 220| 0| 44.0| 98.38699101| 472| 0| 157.3333333333| 272.5093270575|1456.6949024099|16.8404034961| 67864.0|115746.933154476| 237494.0| 15.0| 475048.0| 118762.0|137096.759626185| 237853.0| 15.0| 237516.0| 118758.0|167472.584269784| 237179.0| 337.0| 0| 0| 0| 0| 168| 104| 10.525252185| 6.315151311| 0| 472| 76.8888888889|165.0669897681|27247.1111111111| 0| 0| 0| 1| 0| 0| 0| 0| 0| 86.5| 44.0| 157.3333333333| 0| 0| 0| 0| 0| 0| 5| 220| 3| 472| 14600| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
474926| 5| 3| 220| 472.0| 220| 0| 44.0| 98.38699101| 472| 0| 157.3333333333| 272.5093270575|1457.0691012916|16.8447294947|67846.5714285714|115645.740842248| 237162.0| 15.0| 474926.0| 118731.5|136923.365842601| 237497.0| 15.0| 237732.0| 118866.0|167663.503100705| 237422.0| 310.0| 0| 0| 0| 0| 168| 104|10.5279559342|6.3167735605| 0| 472| 76.8888888889|165.0669897681|27247.1111111111| 0| 0| 0| 1| 0| 0| 0| 0| 0| 86.5| 44.0| 157.3333333333| 0| 0| 0| 0| 0| 0| 5| 220| 3| 472| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
477471| 5| 3| 209| 461.0| 209| 0| 41.8| 93.4676414595| 461| 0| 153.6666666667| 266.1584740964|1403.2265833946|16.7549442793|68210.1428571429|116178.228792989| 238389.0| 17.0| 477471.0| 119367.75|137516.508224467| 238515.0| 149.0| 238887.0| 119443.5|168454.755588852| 238559.0| 328.0| 0| 0| 0| 0| 168| 104|10.4718401746|6.2831041048| 0| 461| 74.4444444444|160.5942955954|25790.5277777778| 0| 0| 0| 1| 0| 0| 0| 0| 0| 83.75| 41.8| 153.6666666667| 0| 0| 0| 0| 0| 0| 5| 209| 3| 461| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
512758| 5| 3| 211| 463.0| 211| 0| 42.2| 94.3620686505| 463| 0| 154.3333333333| 267.3131746348|1314.4602327024|15.6019018718|73251.1428571429|124959.473740394| 256188.0| 10.0| 512758.0| 128189.5|148006.106082373| 256523.0| 10.0| 256563.0| 128281.5|180935.190276795| 256222.0| 341.0| 0| 0| 0| 0| 168| 104| 9.7511886699|5.8507132019| 0| 463| 74.8888888889|161.4058893322|26051.8611111111| 0| 0| 0| 1| 0| 0| 0| 0| 0| 84.25| 42.2| 154.3333333333| 0| 0| 0| 0| 0| 0| 5| 211| 3| 463| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
476711| 5| 3| 206| 458.0| 206| 0| 41.2| 92.126000673| 458| 0| 152.6666666667| 264.4264232888|1392.8774456641|16.7816559719|68101.5714285714|115977.215079597| 238034.0| 16.0| 476711.0| 119177.75| 137207.36560009| 238274.0| 78.0| 238033.0| 119016.5|168007.864103143| 237816.0| 217.0| 0| 0| 0| 0| 168| 104|10.4885349824|6.2931209894| 0| 458| 73.7777777778|159.3783060659|25401.4444444444| 0| 0| 0| 1| 0| 0| 0| 0| 0| 83.0| 41.2| 152.6666666667| 0| 0| 0| 0| 0| 0| 5| 206| 3| 458| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
476616| 5| 3| 211| 463.0| 211| 0| 42.2| 94.3620686505| 463| 0| 154.3333333333| 267.3131746348|1414.1363277775|16.7850009232| 68088.0|115553.073301117| 237285.0| 8.0| 476616.0| 119154.0|136576.972942001| 237558.0| 8.0| 237660.0| 118830.0|167563.093937776| 237315.0| 345.0| 0| 0| 0| 0| 168| 104| 10.490625577|6.2943753462| 0| 463| 74.8888888889|161.4058893322|26051.8611111111| 0| 0| 0| 1| 0| 0| 0| 0| 0| 84.25| 42.2| 154.3333333333| 0| 0| 0| 0| 0| 0| 5| 211| 3| 463| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
477161| 5| 3| 211| 463.0| 211| 0| 42.2| 94.3620686505| 463| 0| 154.3333333333| 267.3131746348|1412.5211406632|16.7658295628|68165.8571428572|116324.530541969| 238504.0| 12.0| 477161.0| 119290.25|137729.560342905| 238719.0| 12.0| 238618.0| 119309.0|168449.805841384| 238421.0| 197.0| 0| 0| 0| 0| 168| 104|10.4786434767| 6.287186086| 0| 463| 74.8888888889|161.4058893322|26051.8611111111| 0| 0| 0| 1| 0| 0| 0| 0| 0| 84.25| 42.2| 154.3333333333| 0| 0| 0| 0| 0| 0| 5| 211| 3| 463| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
474670| 5| 3| 214| 466.0| 214| 0| 42.8| 95.703709437| 466| 0| 155.3333333333| 269.0452254424|1432.5742094508|16.8538142288| 67810.0|115371.262364883| 236717.0| 15.0| 474670.0| 118667.5|136423.221362787| 236894.0| 55.0| 236992.0| 118496.0| 167297.22178805| 236793.0| 199.0| 0| 0| 0| 0| 168| 104| 10.533633893|6.3201803358| 0| 466| 75.5555555556|162.6246530443|26446.7777777778| 0| 0| 0| 1| 0| 0| 0| 0| 0| 85.0| 42.8| 155.3333333333| 0| 0| 0| 0| 0| 0| 5| 214| 3| 466| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
476608| 5| 3| 209| 461.0| 209| 0| 41.8| 93.4676414595| 461| 0| 153.6666666667| 266.1584740964|1405.7674231234|16.7852826642|68086.8571428572|116165.357657993| 238154.0| 21.0| 476608.0| 119152.0|137536.399945614| 238349.0| 28.0| 238442.0| 119221.0|168305.556058022| 238231.0| 211.0| 0| 0| 0| 0| 168| 104|10.4908016651|6.2944809991| 0| 461| 74.4444444444|160.5942955954|25790.5277777778| 0| 0| 0| 1| 0| 0| 0| 0| 0| 83.75| 41.8| 153.6666666667| 0| 0| 0| 0| 0| 0| 5| 209| 3| 461| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
479249| 5| 3| 215| 467.0| 215| 0| 43.0| 96.1509230325| 467| 0| 155.6666666667| 269.6225757116|1423.0598290242|16.6927839182|68464.1428571429|116522.185511928| 239061.0| 18.0| 479249.0| 119812.25|137797.592674606| 239270.0| 21.0| 239238.0| 119619.0|168903.768394906| 239052.0| 186.0| 0| 0| 0| 0| 168| 104|10.4329899489|6.2597939693| 0| 467| 75.7777777778|163.0312683029|26579.1944444444| 0| 0| 0| 1| 0| 0| 0| 0| 0| 85.25| 43.0| 155.6666666667| 0| 0| 0| 0| 0| 0| 5| 215| 3| 467| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
475967| 5| 3| 215| 467.0| 215| 0| 43.0| 96.1509230325| 467| 0| 155.6666666667| 269.6225757116|1432.8724470394|16.8078879418|67995.2857142857|115929.081086548| 237703.0| 16.0| 475967.0| 118991.75|137195.538016305| 237904.0| 21.0| 237915.0| 118957.5|167975.337191208| 237734.0| 181.0| 0| 0| 0| 0| 168| 104|10.5049299636|6.3029579782| 0| 467| 75.7777777778|163.0312683029|26579.1944444444| 0| 0| 0| 1| 0| 0| 0| 0| 0| 85.25| 43.0| 155.6666666667| 0| 0| 0| 0| 0| 0| 5| 215| 3| 467| 14480| 219| 1| 32| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|Benign|
8211;+——————-+————-+————+————+—————+—————+—————+—————+—————-+—————+—————+—————+—————-+—————+—————+————-+—————-+—————-+————+————+————+—————-+—————-+———–+———–+———–+—————-+—————-+———–+———–+————-+————-+————-+————-+————–+————–+————-+————+———–+———–+————–+————–+—————-+————+————+————+————+————+————+————–+————+————-+————–+—————-+—————-+————–+————–+—————-+————–+————–+—————-+—————-+—————-+—————-+—————-+—————–+—————–+—————–+—————-+———–+———-+———-+———-+————+————–+———–+———–+——+

In [ ]:

# The total number of attacks per label
IDS_df.select(‘Label’).groupBy(‘Label’).count().orderBy(‘count’, ascending=False).show()

+——————–+——-+
| Label| count|
+——————–+——-+
| Benign|6870186|
| DDOS attack-HOIC| 686012|
| DoS attacks-Hulk| 461912|
| Bot| 286191|
| FTP-BruteForce| 193360|
| SSH-Bruteforce| 187589|
|DoS attacks-SlowH…| 139890|
|DoS attacks-Golde…| 83016|
| Infilteration| 68871|
|DoS attacks-Slowl…| 21980|
|DDOS attack-LOIC-UDP| 1730|
| Brute Force -Web| 611|
| Brute Force -XSS| 230|
| SQL Injection| 87|
| Label| 34|
| 0| 1|
+——————–+——-+

In [120]:

IDS_df2 = IDS_df.withColumnRenamed(“Tot Fwd Pkts”,”tot_fw_pk”).withColumnRenamed(“Idle Max”,”idl_max”) \
.withColumnRenamed(“dst port”,”dst_port”).withColumnRenamed(“Idle Min”,”idl_min”) \
.withColumnRenamed(“TotLen Fwd Pkts”,”tot_l_fw_pkt”).withColumnRenamed(“Flow Duration”,”fl_dur”) \
.withColumnRenamed(“Flow Byts/s”,”fl_byt_s”).withColumnRenamed(“Fwd PSH Flags”,”fw_psh_flag”) \
.withColumnRenamed(“Active Max”,”atv_max”).withColumnRenamed(“Active Min”,”atv_min”) \
.withColumnRenamed(“Pkt Size Avg”,”pkt_size_avg”).withColumnRenamed(“Fwd Seg Size Avg”,”fw_seg_avg”) \
.withColumnRenamed(“Bwd Seg Size Avg”,”bw_seg_avg”)

Task 1: Spark SQL [30 marks]¶

In [72]:

IDS_df2.createOrReplaceTempView(“IDS”)

In [10]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [11]:

# Pramod Kumar Gouda u2002425
# Query 1 [Briefly explain]: Returns a set of objects with duplicate elements eliminated and it used for collection

sqlDF=spark.sql(“SELECT Protocol, collect_set(tot_fw_pk) as totalfwdpkts FROM IDS WHERE Protocol IS NOT NULL GROUP BY Protocol “)
sqlDF.show()

+——–+——————–+
|Protocol| totalfwdpkts|
+——–+——————–+
| 6|[2207, 356, 1982,…|
| 17|[97718, 95024, 15…|
| 0|[110, 52, 387, 13…|
+——–+——————–+

In [12]:

# Pramod Kumar Gouda u2002425
# Query 2 [Briefly explain]: Selecting the number of protocol based on there type
sqlDF=spark.sql(“SELECT Protocol, count(*) FROM IDS GROUP BY Protocol”)
sqlDF.show()

+——–+——–+
|Protocol|count(1)|
+——–+——–+
| null| 34|
| 6| 6976276|
| 17| 1919793|
| 0| 105597|
+——–+——–+

In [13]:

pandas_df=sqlDF.toPandas()
pandas_df.sort_values(by=’count(1)’,ascending=False).plot(x=’Protocol’,y=’count(1)’,kind=’bar’)

Out[13]:

In [56]:

# Virender Yadav u2002208
# Query 1 [Briefly explain]: Number of forwarded packets by there count
sqlDF=spark.sql(“SELECT tot_fw_pk, count(*) FROM IDS GROUP BY tot_fw_pk HAVING COUNT(tot_fw_pk) > 50 ” )
sqlDF.show()

+———+——–+
|tot_fw_pk|count(1)|
+———+——–+
| 148| 65|
| 31| 2440|
| 85| 154|
| 137| 82|
| 65| 664|
| 53| 928|
| 133| 84|
| 78| 268|
| 155| 69|
| 108| 203|
| 34| 1945|
| 126| 101|
| 115| 190|
| 101| 167|
| 81| 202|
| 28| 4640|
| 76| 215|
| 26| 6940|
| 27| 5227|
| 44| 1018|
+———+——–+
only showing top 20 rows

In [52]:

# Virender Yadav u2002208
# Query 2 [Briefly explain]: finding the average flow duration of packets
sqlDF=spark.sql(“SELECT avg(fl_dur) from IDS”)
sqlDF.show()

+——————–+
| avg(fl_dur)|
+——————–+
|1.0503308551024554E7|
+——————–+

In [23]:

# Nishitha Angali u2001782
# Query 1 [Briefly explain]: To find the count of maximum time a flow was idle before becomming active
# and minimum time a flow was idle before becomming active with conditions given as 0 in both cases
sqlDF=spark.sql(“SELECT count(idl_max),count(idl_min) FROM IDS where idl_max = 0 AND idl_min = 0”)
sqlDF.show()

+————–+————–+
|count(idl_max)|count(idl_min)|
+————–+————–+
| 7925325| 7925325|
+————–+————–+

In [26]:

# Nishitha Angali u2001782
# Query 2 [Briefly explain]: Counting the number of idel max group whose count is greater than 20
sqlDF=spark.sql(“SELECT idl_max,count(*) FROM IDS GROUP BY idl_max HAVING COUNT(idl_max) > 20 “)
sqlDF.show()

+———–+——–+
| idl_max|count(1)|
+———–+——–+
|5.6320958E7| 21|
| 1.001003E7| 29|
|1.0004066E7| 60|
|1.0014804E7| 23|
|1.0014585E7| 41|
|1.0014137E7| 23|
|5.6319468E7| 24|
|1.0001233E7| 24|
| 1.000114E7| 26|
|1.0010131E7| 31|
|5.6318028E7| 22|
|1.0014601E7| 38|
| 1.001432E7| 24|
|1.0003658E7| 35|
| 6.06E7| 36|
| 1.66E7| 35|
| 1.000975E7| 26|
|1.0001486E7| 29|
| 1.001427E7| 21|
|1.0013879E7| 29|
+———–+——–+
only showing top 20 rows

In [27]:

# Meetkumar Rasikbhai Patel u2001677
# Query 1 [Briefly explain]: Counting the number of destination ports by there type
sqlDF=spark.sql(“SELECT dst_port, count(*) from IDS GROUP BY dst_port “)
sqlDF.show()

+——–+——–+
|dst_port|count(1)|
+——–+——–+
| 38422| 29|
| 40386| 35|
| 35982| 39|
| 3997| 5|
| 1829| 4|
| 51415| 204|
| 26706| 4|
| 15846| 3|
| 51607| 184|
| 49308| 56|
| 50348| 216|
| 49855| 222|
| 50353| 206|
| 51393| 206|
| 51123| 210|
| 51595| 188|
| 63964| 26|
| 64519| 29|
| 57020| 69|
| 50223| 223|
+——–+——–+
only showing top 20 rows

In [53]:

# Meetkumar Rasikbhai Patel u2001677
# Query 2 [Briefly explain]: summing up the total length of forwarded packets
sqlDF=spark.sql(“SELECT sum(tot_l_fw_pkt) from IDS”)
sqlDF.show()

+—————–+
|sum(tot_l_fw_pkt)|
+—————–+
| 10422158811|
+—————–+

In [73]:

# Maulik Bhikhabhai Padhiyar u2002324
# Query 1 [Briefly explain]: selecting the different types of PSH flag
sqlDF=spark.sql(“SELECT DISTINCT fw_psh_flag from IDS “)
sqlDF.show()

+———–+
|fw_psh_flag|
+———–+
| null|
| 1|
| 0|
+———–+

In [68]:

# Maulik Bhikhabhai Padhiyar u2002324
# Query 2 [Briefly explain]: average byte rate which is transfered per second
sqlDF=spark.sql(“SELECT count(fl_byt_s) from IDS where fl_byt_s > 0”)
sqlDF.show()

+—————+
|count(fl_byt_s)|
+—————+
| 5781032|
+—————+

Task 2 – Part1: PySpark [45 marks]¶

In [74]:

IDS_df2 = IDS_df2.na.drop()

In [121]:

# Pramod Kumar Gouda u2002425
# Analytical method 1: We are converting required columns from string to float to find skewness,is a measure of the
# asymmetry of the data around sample mean

from pyspark.sql.functions import col
selected_features = [‘tot_fw_pk’,’idl_max’,’atv_max’,’atv_min’,’idl_min’,’tot_l_fw_pkt’,’pkt_size_avg’,’fw_seg_avg’,’bw_seg_avg’]
IDS_selected_features_df = IDS_df2.select(*(col(c).cast(“float”).alias(c) for c in selected_features))
IDS_selected_features_df.show()

+———+———–+——-+——-+———–+————+————+———-+———-+
|tot_fw_pk| idl_max|atv_max|atv_min| idl_min|tot_l_fw_pkt|pkt_size_avg|fw_seg_avg|bw_seg_avg|
+———+———–+——-+——-+———–+————+————+———-+———-+
| 3.0| 5.632096E7| 0.0| 0.0| 5.632076E7| 0.0| 0.0| 0.0| 0.0|
| 3.0|5.6320816E7| 0.0| 0.0|5.6320652E7| 0.0| 0.0| 0.0| 0.0|
| 3.0|5.6319524E7| 0.0| 0.0|5.6319096E7| 0.0| 0.0| 0.0| 0.0|
| 15.0| 0.0| 0.0| 0.0| 0.0| 1239.0| 140.48| 82.6| 227.3|
| 14.0| 0.0| 0.0| 0.0| 0.0| 1143.0| 134.08| 81.64286| 200.81818|
| 16.0| 0.0| 0.0| 0.0| 0.0| 1239.0| 125.42857| 77.4375| 189.41667|
| 3.0|5.6320384E7| 0.0| 0.0|5.6320096E7| 0.0| 0.0| 0.0| 0.0|
| 3.0|5.6320664E7| 0.0| 0.0|5.6320576E7| 0.0| 0.0| 0.0| 0.0|
| 5.0| 0.0| 0.0| 0.0| 0.0| 211.0| 84.25| 42.2| 154.33333|
| 5.0| 0.0| 0.0| 0.0| 0.0| 220.0| 86.5| 44.0| 157.33333|
| 5.0| 0.0| 0.0| 0.0| 0.0| 220.0| 86.5| 44.0| 157.33333|
| 5.0| 0.0| 0.0| 0.0| 0.0| 209.0| 83.75| 41.8| 153.66667|
| 5.0| 0.0| 0.0| 0.0| 0.0| 211.0| 84.25| 42.2| 154.33333|
| 5.0| 0.0| 0.0| 0.0| 0.0| 206.0| 83.0| 41.2| 152.66667|
| 5.0| 0.0| 0.0| 0.0| 0.0| 211.0| 84.25| 42.2| 154.33333|
| 5.0| 0.0| 0.0| 0.0| 0.0| 211.0| 84.25| 42.2| 154.33333|
| 5.0| 0.0| 0.0| 0.0| 0.0| 214.0| 85.0| 42.8| 155.33333|
| 5.0| 0.0| 0.0| 0.0| 0.0| 209.0| 83.75| 41.8| 153.66667|
| 5.0| 0.0| 0.0| 0.0| 0.0| 215.0| 85.25| 43.0| 155.66667|
| 5.0| 0.0| 0.0| 0.0| 0.0| 215.0| 85.25| 43.0| 155.66667|
+———+———–+——-+——-+———–+————+————+———-+———-+
only showing top 20 rows

In [81]:

from pyspark.sql import functions as f
IDS_selected_features_df.select(f.skewness(IDS_selected_features_df[‘tot_fw_pk’])).show()

+——————-+
|skewness(tot_fw_pk)|
+——————-+
| 77.74947867361696|
+——————-+

In [83]:

# Pramod Kumar Gouda u2002425
# Analytical method 2: I am finding the correlation between two columns where if one column increases its corelated
# with other column and if another column value decreases
IDS_selected_features_df.stat.corr(“atv_max”,”atv_min”)

Out[83]:

0.7414425772390605

In [84]:

# Pramod Kumar Gouda u2002425
# Analytical method 3: kernel density estimate on a pyspark dataframe column and use it for
# creating a new column with the estimates
from pyspark.mllib.stat import KernelDensity
dat_rdd = IDS_selected_features_df.select(“tot_fw_pk”).rdd
dat_rdd_data = dat_rdd.map(lambda x: x[0])
kd = KernelDensity()
kd.setSample(dat_rdd_data)
kd.estimate([13.0,14.0])

Out[84]:

array([0.0097561 , 0.01106945])

In [86]:

# Virender Yadav u2002208
# Analytical method 1: finding the kurtosis and is about tails of the distribution,measures of outliners
IDS_selected_features_df.select(f.kurtosis(IDS_selected_features_df[‘atv_min’])).show()

+——————+
| kurtosis(atv_min)|
+——————+
|4022.4866270803145|
+——————+

In [91]:

# Virender Yadav u2002208
# Analytical method 2: find the percentile of a 80th %
IDS_df2.groupby(‘label’).agg(f.expr(‘percentile(atv_max, array(0.80))’)[0].alias(‘%80’)).show()

+——————–+———+
| label| %80|
+——————–+———+
| SSH-Bruteforce| 0.0|
| Label| null|
| Infilteration| 0.0|
| 0| 0.0|
| SQL Injection| 0.0|
|DoS attacks-Slowl…|7678920.2|
| Benign| 0.0|
|DoS attacks-SlowH…| 0.0|
| Bot| 0.0|
|DoS attacks-Golde…| 441.0|
| Brute Force -XSS| 0.0|
| FTP-BruteForce| 0.0|
|DDOS attack-LOIC-UDP| 0.0|
| DoS attacks-Hulk| 0.0|
| Brute Force -Web|3999879.0|
| DDOS attack-HOIC| 0.0|
+——————–+———+

In [95]:

# Virender Yadav u2002208
# Analytical method 3: calculating standard deviation its value is how far from the normal ,squareroot of variance

IDS_df2.agg(f.stddev(“atv_max”)).show()

+——————–+
|stddev_samp(atv_max)|
+——————–+
| 1749057.5452774959|
+——————–+

In [85]:

# Nishitha Angali u2001782
# Analytical method 1: To find skewness,is a measure of the asymmetry of the data around sample mean
IDS_selected_features_df.select(f.skewness(IDS_selected_features_df[‘idl_max’])).show()

+—————–+
|skewness(idl_max)|
+—————–+
|992.0339103023476|
+—————–+

In [97]:

# Nishitha Angali u2001782
# Analytical method 2: finding the correlation between two columns where if one column increases its corelated
# with other column and if another column value decreases
IDS_selected_features_df.stat.corr(“idl_max”,”idl_min”)

Out[97]:

0.33075364446687444

In [101]:

# Nishitha Angali u2001782
# Analytical method 3: find the percentile of a 75th %
IDS_df2.groupby(‘protocol’).agg(f.expr(‘percentile(idl_max, array(0.75))’)[0].alias(‘%75’)).show()

+——–+——-+
|protocol| %75|
+——–+——-+
| null| null|
| 6| 0.0|
| 17| 0.0|
| 0|5.632E7|
+——–+——-+

In [116]:

# Meetkumar Rasikbhai Patel u2001677
# Analytical method 1: finding the kurtosis and is about tails of the distribution,measures of outliners
IDS_selected_features_df.select(f.kurtosis(IDS_selected_features_df[‘tot_l_fw_pkt’])).show()

+———————-+
|kurtosis(tot_l_fw_pkt)|
+———————-+
| 1529504.304325319|
+———————-+

In [113]:

# Meetkumar Rasikbhai Patel u2001677
# Analytical method 2: calucating the average value for the column
IDS_df2.select(f.mean(“tot_l_fw_pkt”)).show()

+——————+
| avg(tot_l_fw_pkt)|
+——————+
|1157.8033234070226|
+——————+

In [122]:

# Meetkumar Rasikbhai Patel u2001677
# Analytical method 3: finding the correlation between two columns where if one column increases its corelated
# with other column and if another column value decreases

IDS_selected_features_df.stat.corr(“fw_seg_avg”,”bw_seg_avg”)

Out[122]:

0.37286275767327126

In [119]:

# Maulik Bhikhabhai Padhiyar u2002324
# Analytical method 1: To find skewness,is a measure of the asymmetry of the data around sample mean
IDS_selected_features_df.select(f.skewness(IDS_selected_features_df[‘pkt_size_avg’])).show()

+———————-+
|skewness(pkt_size_avg)|
+———————-+
| 3.4566237116943674|
+———————-+

In [123]:

# Maulik Bhikhabhai Padhiyar u2002324
# Analytical method 2:
IDS_df2.agg(f.stddev(“idl_min”)).show()

+——————–+
|stddev_samp(idl_min)|
+——————–+
| 8.435726565284234E7|
+——————–+

In [124]:

# Maulik Bhikhabhai Padhiyar u2002324
# Analytical method 3: Obtaining the maximum value in column
IDS_df2.agg(f.max(“idl_max”)).show()

+————+
|max(idl_max)|
+————+
| 9.79781E11|
+————+

Task 2 – Part2: PySpark [15 marks]¶

In [ ]:

# pramod Kumar Gouda u2002425
# Machine Learning Technique:
# What to achieve:
from pyspark.mllib.feature import Word2Vec
wv_rdd = IDS_df2.rdd
inp = wv_rdd.map(lambda row: row.split(” “))
word2vec = Word2Vec()
model = word2vec.fit(inp)
synonyms = model.findSynonyms(‘Benign’, 5)
for word, cosine_distance in synonyms:
print(“{}: {}”.format(word, cosine_distance))

In [ ]:

# Virender Yadav u2002208
# Machine Learning Technique:
# What to achieve:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
# Load and parse the data
wv_rdd = IDS_df2.rdd
parsedData = wv_rdd.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(‘ ‘)]))
# Index documents with unique IDs
corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
# Cluster the documents into three topics using LDA
ldaModel = LDA.train(corpus, k=3)
# Output topics. Each is a distribution over words (matching word count vectors)
print(“Learned topics (as distributions over vocab of ” + str(ldaModel.vocabSize())
+ ” words):”)
topics = ldaModel.topicsMatrix()
for topic in range(3):
print(“Topic ” + str(topic) + “:”)
for word in range(0, ldaModel.vocabSize()):
print(” ” + str(topics[word][topic]))

In [ ]:

# Nishitha Angali u2001782
# Machine Learning Technique:
# What to achieve:
from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
# Load and parse the data
wv_rdd = IDS_df2.rdd
parsedData = wv_rdd.map(lambda line: array([float(x) for x in line.split(‘,’)]))
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode=”random”)
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point – center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print(“Within Set Sum of Squared Error = ” + str(WSSSE))

In [ ]:

# Meetkumar Rasikbhai Patel u2001677
# Machine Learning Technique:
# What to achieve:
from pyspark.mllib.feature import ElementwiseProduct
from pyspark.mllib.linalg import Vectors
data = sc.textFile(“data/mllib/kmeans_data.txt”)
parsedData = data.map(lambda x: [float(t) for t in x.split(” “)])
# Create weight vector.
transformingVector = Vectors.dense([0.0, 1.0, 2.0])
transformer = ElementwiseProduct(transformingVector)
# Batch transform
transformedData = transformer.transform(parsedData)
# Single-row transform
transformedData2 = transformer.transform(parsedData.first())

In [ ]:

# Maulik Bhikhabhai Padhiyar u2002324
# Machine Learning Technique:
# What to achieve:
from pyspark.mllib.feature import Normalizer
from pyspark.mllib.util import MLUtils
data = MLUtils.loadLibSVMFile(sc, “data/mllib/sample_libsvm_data.txt”)
labels = data.map(lambda x: x.label)
features = data.map(lambda x: x.features)
normalizer1 = Normalizer()
normalizer2 = Normalizer(p=float(“inf”))
# Each sample in data1 will be normalized using $L^2$ norm.
data1 = labels.zip(normalizer1.transform(features))
# Each sample in data2 will be normalized using $L^\infty$ norm.
data2 = labels.zip(normalizer2.transform(features))

Convert ipynb to HTML for Turnitin submission [10 marks]¶

In [125]:

# install nbconvert
!pip install nbconvert

Requirement already satisfied: nbconvert in /usr/local/lib/python3.6/dist-packages (5.6.1)
Requirement already satisfied: bleach in /usr/local/lib/python3.6/dist-packages (from nbconvert) (3.2.1)
Requirement already satisfied: jinja2>=2.4 in /usr/local/lib/python3.6/dist-packages (from nbconvert) (2.11.2)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/dist-packages (from nbconvert) (0.6.0)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/dist-packages (from nbconvert) (4.3.3)
Requirement already satisfied: nbformat>=4.4 in /usr/local/lib/python3.6/dist-packages (from nbconvert) (5.0.8)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert) (0.8.4)
Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from nbconvert) (2.6.1)
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/dist-packages (from nbconvert) (0.3)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.6/dist-packages (from nbconvert) (4.7.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert) (1.4.3)
Requirement already satisfied: testpath in /usr/local/lib/python3.6/dist-packages (from nbconvert) (0.4.4)
Requirement already satisfied: webencodings in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert) (0.5.1)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert) (1.15.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert) (20.7)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2>=2.4->nbconvert) (1.1.1)
Requirement already satisfied: decorator in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.2->nbconvert) (4.4.2)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.2->nbconvert) (0.2.0)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.4->nbconvert) (2.6.0)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from packaging->bleach->nbconvert) (2.4.7)

In [ ]:

# convert ipynb to html
# file name: Group115_CN7031_CN7031.ipynb
!jupyter nbconvert –to html Group115_CN7031.ipynb

What Will You Get?

We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.

Premium Quality

Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.

Experienced Writers

Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.

On-Time Delivery

Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.

24/7 Customer Support

Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.

Complete Confidentiality

Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.

Authentic Sources

We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.

Moneyback Guarantee

Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.

Order Tracking

You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.

Order Now Talk to Us

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

Essay

Thesis

Presentation

Dissertation

Term Paper

Research Paper

Book Review

Assignment

Report

Case Study

Letter

Article

Coursework

Speech

Q & A

Critical Thinking

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

Essay

Thesis

Presentation

Dissertation

Term Paper

Research Paper

Book Review

Assignment

Report

Case Study

Letter

Article

Coursework

Speech

Q & A

Critical Thinking

Trusted Partner of 9650+ Students for Writing

From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.

Preferred Writer

Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.

Grammar Check Report

Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.

One Page Summary

You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.

Plagiarism Report

You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.

Free Features $66FREE

Most Qualified Writer $10FREE
Plagiarism Scan Report $10FREE
Unlimited Revisions $08FREE
Paper Formatting $05FREE
Cover Page $05FREE
Referencing & Bibliography $10FREE
Dedicated User Area $08FREE
24/7 Order Tracking $05FREE
Periodic Email Alerts $05FREE

Our Services

Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.

On-time Delivery
24/7 Order Tracking
Access to Authentic Sources

Academic Writing

We create perfect papers according to the guidelines.

Professional Editing

We seamlessly edit out errors from your papers.

Thorough Proofreading

We thoroughly read your final draft to identify errors.

Thorough Proofreading

We thoroughly read your final draft to identify errors.

Delegate Your Challenging Writing Tasks to Experienced Professionals

Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!

Check Out Our Sample Work

Dedication. Quality. Commitment. Punctuality

It May Not Be Much, but It’s Honest Work!

Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.

0+

Happy Clients

0+

Words Written This Week

0+

Ongoing Orders

0%

Customer Satisfaction Rate

Process as Fine as Brewed Coffee

We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.

Call Us +1 (877) 657-8180 Discuss Order Details Now

See How We Helped 9000+ Students Achieve Success

We Analyze Your Problem and Offer Customized Writing

We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.

Clear elicitation of your requirements.
Customized writing as per your needs.

We Mirror Your Guidelines to Deliver Quality Services

We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.

Proactive analysis of your writing.
Active communication to understand requirements.

We Handle Your Writing Tasks to Ensure Excellent Grades

We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.

Thorough research and analysis for every order.
Deliverance of reliable writing service to improve your grades.

Place an Order Start Chat Now

Need two Big Data Query & Analysis using Spark SQL, three queries Advanced Analytics using PySpark

Big Data Analytics [CN7031] CRWK 2020-21¶

Group ID: [115]¶

Initiate and Configure Spark¶

Load Data¶

Task 1: Spark SQL [30 marks]¶

Task 2 – Part1: PySpark [45 marks]¶

Task 2 – Part2: PySpark [15 marks]¶

Convert ipynb to HTML for Turnitin submission [10 marks]¶

What Will You Get?

Premium Quality

Experienced Writers

On-Time Delivery

24/7 Customer Support

Complete Confidentiality

Authentic Sources

Moneyback Guarantee

Order Tracking

Areas of Expertise

Essay

Thesis

Presentation

Dissertation

Term Paper

Research Paper

Book Review

Assignment

Report

Case Study

Letter

Article

Coursework

Speech

Q & A

Critical Thinking

Areas of Expertise

Essay

Thesis

Presentation

Dissertation

Term Paper

Research Paper

Book Review

Assignment

Report

Case Study

Letter

Article

Coursework

Speech

Q & A

Critical Thinking

Trusted Partner of 9650+ Students for Writing

Preferred Writer

Grammar Check Report

One Page Summary

Plagiarism Report

Free Features $66FREE

Our Services

Academic Writing

Professional Editing

Thorough Proofreading

Thorough Proofreading

Delegate Your Challenging Writing Tasks to Experienced Professionals

Check Out Our Sample Work

It May Not Be Much, but It’s Honest Work!

0+

Happy Clients

0+

Words Written This Week

0+

Ongoing Orders

0%

Customer Satisfaction Rate

Process as Fine as Brewed Coffee

Share Your Requirements

Place Order & Deposit Funds

Release Payment to Your Writer

See How We Helped 9000+ Students Achieve Success

We Analyze Your Problem and Offer Customized Writing