pyspark Batch Gradient Descent

This homework need implement Batch Gradient Descent to fit a line into a two dimensional data set. You will implement a set of Spark jobs that will learn parameters for such line from the New York City Taxi trip reports in the Year 2013. 

All the homework demand has been uploaded on file. Please only write the pyspark code for small data that I uploaded. I will run the final code on the my large data set. I only need py file for all the task.

Don't use plagiarized sources. Get Your Custom Essay on
pyspark Batch Gradient Descent
Just from $13/Page
Order Essay

Assignment

3

MET CS 777 – Big Data Analytics
Batch Gradient Descent

(

2

0 points)

GitHub Classroom Invitation Link
https://classroom.github.com/a/DEne

6

3c6

1

  • Description
  • In this assignment you will implement Batch Gradient Descent to fit a line into a two dimensional
    data set. You will implement a set of Spark jobs that will learn parameters for such line from
    the New York City Taxi trip reports in the Year 2013. The dataset was released under the FOIL
    (The Freedom of Information Law) and made public by Chris Whong (https://chriswhong.
    com/open-data/foil_nyc_taxi/). See the Assignment 1 for details about this data set.

    We would like to train a linear model between travel distance in miles and fare amount (the
    money that is paid to the taxis).

    2

  • Taxi Data Set – Same data set as Assignment 1
  • This is the same data set as use for the Assignment 1. Please have a look on the table description
    there.

    The data set is in Comma Separated Volume Format (CSV). When you read a line and split it
    by comma sign ”,” you will the an string array with length of 17. With index number started from
    zero, we need for this assignment to get index

    5

    trip distance (trip distance in miles) and index 11
    fare amount ( fare amount in dollars) as stated on the following table.

    index 5 (this our X-axis) trip distance trip distance in miles
    index 11 (this our Y-axis) fare amount fare amount in dollars

    Table 1: Taxi Data Set fields

    1

    https://classroom.github.com/a/DEne63c6

    FOILing NYC’s Taxi Trip Data

    FOILing NYC’s Taxi Trip Data

    You can use the following PySpark Code to cleanup the data and get the required field.
    1 l i n e s = sc . t e x t F i l e ( s y s . a rgv [ 1 ] )
    2 t a x i l i n e s = l i n e s . map ( lambda x : x . s p l i t ( ’ , ’ ) )
    3

    4

    # E x c e p t i o n Hand l ing and removing wrong d a t a l i n e s
    5 d e f i s f l o a t ( v a l u e ) :
    6 t r y :
    7 f l o a t ( v a l u e )
    8 r e t u r n True
    9 e x c e p t :

    10 r e t u r n F a l s e
    11

    12 # Remove l i n e s i f t h e y don ’ t have 16 v a l u e s and . . .
    13 d e f c o r r e c t R o w s ( p ) :
    14 i f ( l e n ( p ) == 17) :
    15 i f ( i s f l o a t ( p [ 5 ] ) and i s f l o a t ( p [ 1 1 ] ) ) :
    16 i f ( f l o a t ( p [ 5 ] ) !=0 and f l o a t ( p [ 1 1 ] ) ! = 0 ) :
    17 r e t u r n p
    18

    19 # c l e a n i n g up d a t a
    20 t e x i l i n e s C o r r e c t e d = t a x i l i n e s . f i l t e r ( c o r r e c t R o w s )

    • In addition to above filtering, you shoud remove all of the rides that have total amount of
    larger than 600 USD and less than 1 USD.

    • You can preprocess the data, clean it and store it in your own cluster storage. To avoid
    additional computation in each run.

    3

  • Obtaining the Dataset
  • You can find the data sets here:

    Small data set
    https://metcs777.s3.amazonaws.com/taxi-data-sorted-small.csv.bz2

    Amzon AWS
    Small Data Set s3://metcs777/taxi-data-sorted-small.csv.bz2
    Large Data Set s3://metcs777/taxi-data-sorted-large.csv.bz2

    Table 2: Data set on Amazon AWS – URLs

    Google Cloud
    Small Data Set gs://metcs777/taxi-data-sorted-small.csv.bz2
    Large Data Set gs://metcs777/taxi-data-sorted-large.csv.bz2

    Table 3: Data set on Google Cloud Storage – URLs

    2

    https://metcs777.s3.amazonaws.com/taxi-data-sorted-small.csv.bz2

    4

  • Assignment Tasks
  • 4.1 Task 1 : Simple Linear Regression (4 points)
    We want to find a simple line to our data (distance, money). Consider a Simple Linear Regression model given in
    equation (1). The solutions for m slope of the line and y-intercept is calculated based on the equations (2) and (3).

    Y = mX + b (1)

    m̂ =

    n
    n∑

    i=1

    xi

    yi −
    n∑

    i=1

    xi
    n∑

    i=1

    yi

    n
    n∑
    i=1

    xi2 − (
    n∑

    i=1

    xi)2
    (2)

    b̂ =

    n∑
    i=1

    xi
    2

    n∑
    i=1
    yi −
    n∑
    i=1
    xi
    n∑
    i=1

    xiyi

    n
    n∑
    i=1
    xi2 − (
    n∑
    i=1

    xi)2
    (3)

    Implement a PySpark Job that calculates the exact answers for the parameters m and b. The
    slope of the line is the parameter m and b is the y-intercept of the line.

    Run your implementation on the large data set and report the computation time for your Spark
    Job for this task. You can find the time for the completion of your Job on the Cloud System. You
    find there a place that reports the time for you.

    Note on Task 1: Execution of this task on Large data set depending on your implementation
    can take longer time, for example on a cluster with 12 cores in total it takes more than 30 min
    computation time.

    4.2 Task 2 – Find the Parameters using Gradient Descent (8 Points)
    In this task, you should implement the batch gradient descent to find the optimal parameters for our
    Simple Linear Regression model.

    • You should load the data into spark cluster memory as RDD or Dataframe

    • Start with all parameters set to 0.1 and do 100 iterations.

    Cost function will be then

    L(m, b) =
    n∑

    i=1

    (yi − (mxi + b))2

    3

    Partial Derivatives to update the parameters m and b

    ∂m
    =

    −2
    N

    n∑
    i=1

    xi(yi − (mxi + b))

    ∂b
    =

    −2
    N
    n∑
    i=1

    (yi − (mxi + b)

    Here is a list of important setup parameters:

    • Initialize all parameters with 0.1

    • Initial your learning rate to be learningRate=0.0000001 and change it if cost is not decreasing
    fast.

    • Maximum number of iteration should be 400 iterations

    • You can also stop the training when the cost are not changing with precision of 0.01.

    • Print out the costs in each iteration

    Run your implementation on the large data set and report the computation time for your Spark
    Job for this task. Compare the computation time with the previous tasks.

    • Create a plot that show the cost in each iteration

    • Print out the model parameters in each iteration

    Note: You might write some code for the iteration of gradient descent in PySpark that can work
    perfect on your laptop but does not run on the clusters (AWS/Google Cloud). The main reason is
    that on your laptop it is running in single process while on a cluster it runs on multiple processes
    (shared-nothing processes). You need to be careful to reduce all of jobs/processes to be able to
    update the variables, otherwise each processes will have its own variables.

    4.3 Task 3 – Fit Multiple Linear Regression using Gradient Descent (8 Points)
    We would like to build a data model based on Linear Regression to predict the total amount of
    money that each driver can make per day. This is the amount that a driver gets per day (fare amount
    plus tip amount).

    We want to use the following 5 features for this linear regression model are:

    • Total working time in hours that a driver worked per day (a float number)

    • Total travel distance in miles that a driver drove a taxi per day (a float number)

    • Total number of rides per day (an integer number)

    • Total amount of toll per day (a float number – toll amount indicates the number of rides over
    the NYC bridges or rides to the airport.

    • Total number of night rides between 1:00 AM and 6:00 AM per day.

    4

    Consider the following points when you implement your model training using gradient descent:

    • Initialize all parameters with 0.1
    • Initial your learning rate to be learningRate=0.0000001 and change it if cost is not decreasing
    fast.

    • Maximum number of iteration should be 400

    • Use numpy Arrays to calculate the gradients – Vectorization

    • You can stop the model training earlier than 400 iterations when cost is not changing with a
    precision of 0.01.

    Sub-Tasks:

    • Implement ”Bold Driver” technique to dynamically change the learning rate. (1 point of 8
    points)

    • Print out the costs in each iteration
    • Print out the model parameters in each iteration

    • Create a plot of cost (y-axis) – iteration no (x-axis) and show the costs per iterations.

    5

  • Important Considerations
  • 5.1 Machines to Use
    One thing to be aware of is that you can choose virtually any configuration for your EMR cluster
    – you can choose different numbers of machines, and different configurations of those machines.
    And each is going to cost you differently!

    Pricing information is available at: http://aws.amazon.com/elasticmapreduce/
    pricing/

    Since this is real money, it makes sense to develop your code and run your jobs locally, on your
    laptop, using the small data set. Once things are working, you’ll then move to Amazon EMR. We
    are going to ask you to run your Spark jobs over the “real” data using 3 m5.xlarge machines 1
    Master node and 2 workers.

    As you can see on EC2 Price list , this costs around 50 cents per hour. That is not much, but IT
    WILL ADD UP QUICKLY IF YOU FORGET TO SHUT OFF YOUR MACHINES. Be very
    careful, and stop your machine as soon as you are done working. You can always come back and
    start your machine or create a new one easily when you begin your work again. Another thing to
    be aware of is that Amazon charges you when you move data around. To avoid such charges, do
    everything in the ”N. Virginia” region. That’s where data is, and that’s where you should put your
    data and machines.

    • You should document your code very well and as much as possible.

    • You code should be compilable on a unix-based operating system like Linux or MacOS.

    5

    http://aws.amazon.com/elasticmapreduce/pricing/

    http://aws.amazon.com/elasticmapreduce/pricing/

    5.2 Academic Misconduct Regarding Programming
    In a programming class like our class, there is sometimes a very fine line between ”cheating”
    and acceptable and beneficial interaction between peers. Thus, it is very important that you fully
    understand what is and what is not allowed in terms of collaboration with your classmates. We
    want to be 100% precise, so that there can be no confusion.

    The rule on collaboration and communication with your classmates is very simple: you cannot
    transmit or receive code from or to anyone in the class in any way—visually (by showing some-
    one your code), electronically (by emailing, posting, or otherwise sending someone your code),
    verbally (by reading code to someone) or in any other way we have not yet imagined. Any other
    collaboration is acceptable.

    The rule on collaboration and communication with people who are not your classmates (or
    your TAs or instructor) is also very simple: it is not allowed in any way, period. This disallows (for
    example) posting any questions of any nature to programming forums such as StackOverflow. As
    far as going to the web and using Google, we will apply the ”two line rule”. Go to any web page
    you like and do any search that you like. But you cannot take more than two lines of code from
    an external resource and actually include it in your assignment in any form. Note that changing
    variable names or otherwise transforming or obfuscating code you found on the web does not
    render the ”two line rule” inapplicable. It is still a violation to obtain more than two lines of code
    from an external resource and turn it in, whatever you do to those two lines after you first obtain
    them.

    Furthermore, you should cite your sources. Add a comment to your code that includes the
    URL(s) that you consulted when constructing your solution. This turns out to be very helpful when
    you’re looking at something you wrote a while ago and you need to remind yourself what you were
    thinking.

    5.3 Turnin
    Create a single document that has results for all three tasks. For each task, copy and paste the
    result that your last Spark job wrote to Amazon S3. Also for each task, for each Spark job you ran,
    include a screen shot of the Spark History.

    Please zip up all of your code and your document (use .zip only, please!), or else attach each
    piece of code as well as your document to your submission individually.

    Please have the latest version of your code on the GitHub. Zip the files from GitHub and submit
    as your latest version of assignment work to the Blackboard. We will consider the latest version on
    the Blackboard but it should exactly match your code on the GitHub

    6

      Description
      Taxi Data Set – Same data set as Assignment 1
      Obtaining the Dataset
      Assignment Tasks
      Task 1 : Simple Linear Regression (4 points)
      Task 2 – Find the Parameters using Gradient Descent (8 Points)
      Task 3 – Fit Multiple Linear Regression using Gradient Descent (8 Points)
      Important Considerations
      Machines to Use
      Academic Misconduct Regarding Programming
      Turnin

    What Will You Get?

    We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.

    Premium Quality

    Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.

    Experienced Writers

    Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.

    On-Time Delivery

    Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.

    24/7 Customer Support

    Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.

    Complete Confidentiality

    Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.

    Authentic Sources

    We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.

    Moneyback Guarantee

    Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.

    Order Tracking

    You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.

    image

    Areas of Expertise

    Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

    Areas of Expertise

    Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

    image

    Trusted Partner of 9650+ Students for Writing

    From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.

    Preferred Writer

    Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.

    Grammar Check Report

    Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.

    One Page Summary

    You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.

    Plagiarism Report

    You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.

    Free Features $66FREE

    • Most Qualified Writer $10FREE
    • Plagiarism Scan Report $10FREE
    • Unlimited Revisions $08FREE
    • Paper Formatting $05FREE
    • Cover Page $05FREE
    • Referencing & Bibliography $10FREE
    • Dedicated User Area $08FREE
    • 24/7 Order Tracking $05FREE
    • Periodic Email Alerts $05FREE
    image

    Our Services

    Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.

    • On-time Delivery
    • 24/7 Order Tracking
    • Access to Authentic Sources
    Academic Writing

    We create perfect papers according to the guidelines.

    Professional Editing

    We seamlessly edit out errors from your papers.

    Thorough Proofreading

    We thoroughly read your final draft to identify errors.

    image

    Delegate Your Challenging Writing Tasks to Experienced Professionals

    Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!

    Check Out Our Sample Work

    Dedication. Quality. Commitment. Punctuality

    Categories
    All samples
    Essay (any type)
    Essay (any type)
    The Value of a Nursing Degree
    Undergrad. (yrs 3-4)
    Nursing
    2
    View this sample

    It May Not Be Much, but It’s Honest Work!

    Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.

    0+

    Happy Clients

    0+

    Words Written This Week

    0+

    Ongoing Orders

    0%

    Customer Satisfaction Rate
    image

    Process as Fine as Brewed Coffee

    We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.

    See How We Helped 9000+ Students Achieve Success

    image

    We Analyze Your Problem and Offer Customized Writing

    We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.

    • Clear elicitation of your requirements.
    • Customized writing as per your needs.

    We Mirror Your Guidelines to Deliver Quality Services

    We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.

    • Proactive analysis of your writing.
    • Active communication to understand requirements.
    image
    image

    We Handle Your Writing Tasks to Ensure Excellent Grades

    We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.

    • Thorough research and analysis for every order.
    • Deliverance of reliable writing service to improve your grades.
    Place an Order Start Chat Now
    image

    Order your essay today and save 30% with the discount code Happy