Discussion N

Prompt

Read the article below and discuss how breaking the traditional RAID concepts helps Big Data deal with ever-growing needs of a storage system.

Don't use plagiarized sources. Get Your Custom Essay on
Discussion N
Just from $13/Page
Order Essay

GoogleFile System Eval: Part I

GFS: Fundamental Paradigm Shift or One-off Oddity?

As regular readers know, I believe that the current model of enterprise storage is badly broken.

When I see something that blows that model away I like to learn more. In particular, I assess the

marketability of the innovation. So a cool technology, like GFS or backup compression, makes

me wonder if and how customers would buy it.

This article offers a 100,000 foot view of GFS by way assessing its commercial viability. If you

want a technical article about GFS, I can only recommend The Google File System by

Ghemawat, Gobioff, & Leung, from which this article draws heavily. The Wikipedia article,

oddly, isn’t very good on GFS (as of 5-16-06). If you don’t have a BS/CS or better you’ll likely

find Ghemawat et. al. a slog. Probably worthwhile. Good for temporary relief of insomnia.

Google From Space

GFS is one of the key technologies that enables the most powerful general purpose cluster in

history. While most IT folks lose sleep over keeping the Exchange server backed up for a couple

of thousand users, Google’s infrastructure both supports massive user populations and the

regular roll out of compute and data intensive applications that would leave most IT ops folks

gibbering in fear. How do they do it?

Partly they are smarter than you. Google employs hundreds of CompSci PhDs as well as many

more hundreds of really smart people. Partly it is their history: impoverished PhD candidates

can’t afford fancy hardware to build their toys, so they started cheap and got cheaper. And

finally, being really smart and really poor, they rethought the whole IT infrastructure paradigm.

Their big insight: rather than build availability into every IT element at great cost, build

availability around every element at low cost. Which totally changes the economics of IT, just

as the minicomputer in the ’70s, the PC in the ’80’s and the LAN in the ’90’s all did. Only more

so. When processors, bandwidth and storage are cheap you can afford to spend lots of cycles on

what IBM calls autonomic computing. With the system properly architected and cheap to build

out, it scales both operationally and economically.

All that said, Google hasn’t done anything unique with their platform that other people hadn’t

already. They just put it together and scaled it to unprecedented heights.

Note to CIOs: it isn’t going to take your users long to notice that Google can do this stuff and

you can’t. Your life isn’t going to get any easier.

GFS From Low-Earth Orbit

Despite the name, GFS (not to be confused (although how exactly I don’t know) with Sistina’s

GFS – maybe we should call it GooFS) is not just a file system. It also maintains data

redundancy, supports low-costs snapshots, and, in addition to normal create, delete, open, close,

read, write operations also offers a record append operation.

That record append operation reflects part of the unique nature of the Google workload: fed by

hundreds of web-crawling bots, Google’s data is constantly updated with large sequential writes.

Rather than synchronize and coordinate the overwriting of existing data it is much cheaper to

simply append new data to existing data.

Another feature of the Google workload is that it mostly consists of two kinds of reads: large

streaming reads and small random reads. As large reads and writes are so common, GFS is

optimized for sustained bandwidth rather than low latency or IOPS. As multi-gigabyte files are

the common case, GFS is optimized for handling a few million files, so, doing the math, a single

GFS should be able to handle a few petabytes of active data.

All this is built on very cheap components whose frequent failure, given the size of cluster, is

expected. The system monitors itself and detects, tolerates, and recovers quickly from

component failures, including disk, network and server failures.

GFS From An SR-71 Blackbird

A GFS cluster consists of a single master and multiple chunkservers, and is accessed by multiple

clients. Each of these is typically a dirt-cheap Linux box (lately dual 2 GHz xeons with 2 GB

ram and ~800GB of disk).

Files are divided into chunks, each identified by a unique 64-bit handle, and are stored on the

local systems as Linux files. Each chunk is replicated at least once on another server, and the

default is three copies of every chunk (take thatRAID-6 fanboys!). The chunks are big, like the

files they make up: 64MB is the standard chunk size. The chunkservers don’t cache file data

since the chunks are stored locally and the Linux buffer cache keeps frequently accessed data in

memory.

If, like me, you thought bottleneck/SPOF when you saw the single master, you would, like me,

have been several steps behind the architects. The master only tells clients (in tiny multibyte

messages) which chunkservers have needed chunks. Clients then interact directly with

chunkservers for most subsequent operations. Now grok one of the big advantages of a large

chunk size: clients don’t need much interaction with masters to gather access to a lot of data.

That covers the bottleneck problem, but what about the SPOF (single point of failure) problem?

We know the data is usually copied three times — when disk is really cheap you can afford that

— but what about the all-important metadata that keeps track of where all the chunks are?

The master stores — in memory for speed — three major types of metadata:

• File and chunk names [or namespaces in geekspeak]

• Mapping from files to chunks, i.e. the chunks that make up each file

• Locations of each chunk’s replicas

So if the master crashes, this data has to be replaced pronto. The first two — namespaces and

mapping — are kept persistent by a log stored on the master’s local disk and replicated on

remote machines. This log is checkpointed frequently for fast recovery if a new master is needed.

How fast? Reads start up almost instantly thanks to shadow masters who stay current with the

master in the background. Writes pause for about 30-60 seconds while the new master and the

chunkservers make nice. Many RAID arrays recover no faster.

The last type of metadata, replica locations, is stored on each chunkserver — and copied on

nearby machines — and given to the master at startup or when a chunkserver enters a cluster.

Since the master controls the chunk placement it is able to keep itself up-to-date as new chunks

get written.

The master also keeps track of the health of the cluster through handshaking with all the

chunkservers. Data corruption is detected through checksumming. Even so, data may still get

pooched. Thus the GFS reliance on appending writes instead of overwrites; combined with

frequent checkpoints, snapshots and replicas, the chance of data loss is very low, and results in

data unavailability, not data corruption.

GFS RAID

They don’t call it that, but StorageMojo.com cares about storage, and I find this part particularly

interesting. GFS doesn’t use any RAID controllers, fibre channel, iSCSI, HBAs, FC or SCSI

disks, dual-porting or any of the other costly bling we expect in a wide-awake data center. And

yet it all works and works very well.

Take replica creation or what you and I would call mirroring. All the servers in the cluster are

connected over a full duplex switched Ethernet fabric with pipelined data transfers. This means

that as soon as a new chunk starts arriving, the chunkserver can begin making replicas at full

network bandwidth (about 12MB/sec) without reducing the incoming data rate. As soon as the

first replica chunkserver has received some data it repeats the process, so the two replicas are

completed soon after the first chunk write finishes.

In addition to creating replicas quickly, the master’s replica placement rules also spread them

across machines and across racks, to limit the chance of data unavailability due to power or

network switch loss.

Pooling and Balancing

Storage virtualization may be on the downside of the hype cycle, and looking at GFS you can see

what simple virtualization looks like when built into the file system. Instead of a complex

software layer to “pool” all the blocks across RAID arrays, GFS masters place new replicas on

chunkservers with below average disk utilization. So over time disk utilization equalizes across

servers without any tricky and expensive software.

The master also rebalances replicas periodically, looking at disk space utilization and load

balancing. This process also keeps a new chunkserver from being swamped the moment it joins

the cluster. The master allocates data to it gradually. The master also moves chunks from

chunkservers with above average disk utilization to equalize usage.

Storage capacity is reclaimed slowly. Rather than eager deletion, the master lets old chunks hang

around for a few days and reclaims storage in batches. Done in the background when the master

isn’t too busy, it minimizes the impact on the cluster. In addition, since the chunks are renamed

rather than deleted, the system provides another line of defense against accidental data loss.

Cap’n, The Dilithium Crystals Canna Take ‘N’More! Oh, Shut Up, Scotty.

Google ran a couple of tests to test dilithium crystals GFS clusters.

We all know this must work in the real world since we all use Google everyday. But how well

does it work? In the paper they present some statistics from a couple of Google GFS clusters.

Google File System Eval: Part II

In yesterday’s post I ran through a quick (really, it was!) overview of the Google File System’s

organization and storage-related features such as RAID and high-availability. I want to offer a

little more data about the performance of GFS before offering my conclusion about the

marketability of GFS as a commercial product.

The Google File System (Links to an external site.)Links to an external site. by Ghemawat,

Gobioff, & Leung, includes some interesting performance info. These examples can’t be

regarded as representative since we don’t know enough about the population of GFS clusters at

Google, so any conclusions drawn from them are necessarily tentative.

They looked at two GFS clusters configured like this:

Cluster A B

Chunkservers 342 227

Available Disk Cap. 72 TB 180 TB

Used Disk Cap 55 TB 155 TB

Number of Files 735 k 737 k

Number of Dead Files 22 k 232 k

Number of Chunks 992 k 1550 k

Metadata at Chunkservers 13 GB 21 GB

Metadata at Master 48 MB 60 MB

So we have a couple of fair sized storage systems, one utilizing about 80% of available space,

while the other is close to 90%. Respectable numbers for any data center storage manager. We

also see that chunk metadata appears to scale linearly with the number of chunks. Good. The

average file size on A appears to be roughly 1/3 that of B. The average files sizes appear to be

about 75 MB for A and 210 MB for B. Much larger than the average data center file size.

Next we get some performance data for the two clusters:

http://labs.google.com/papers/gfs.html

Cluster A B

Read Rate – last minute 583 MB/s 380 MB/s

Read Rate – last hour 562 MB/s 384 MB/s

Read Rate – since restart 589 MB/s 49 MB/s

Write Rate – last minute 1 MB/s 101 MB/s

Write Rate – last hour 2 MB/s 117 MB/s

Write Rate – since restart 25 MB/s 13 MB/s

Master Ops – last minute 325 Op/s 533 Op/s

Master Ops – last hour 381 Op/s 518 Op/s

Master Ops – since restart 202 Op/s 347 Op/s

Just as the gentlemen said, there is excellent sequential read performance, very good sequential

write performance, and unimpressive small write performance. Looking at cluster A’s

performance, I infer that in the last minute it performed about 125 small writes, averaging about

8k each. Clearly, not ready for the heads-down, 500 desk, Oracle call center. Not the design

center either. It appears to me though, that this performance would compete handily with an

EMC Centera or even the new NetApp FAS6000 series on a large file workload. Not bad for a 3

year old system constructed from commodity parts.

Conclusion

The GFS implementation we’ve looked at here offers many winning attributes.

These include:

• Availability. Triple redundancy (or more if users choose), pipelined chunk replication,

rapid master failovers, intelligent replica placement, automatic re-replication, and cheap

snapshot copies. All of these features deliver what Google users see every day:

datacenter-class availability in one of the world’s largest datacenters.

• Performance. Most workloads, even databases, are about 90% reads. GFS performance

on large sequential reads is exemplary. It was child’s play for Google to add video

download to their product set, and I suspect their cost-per-byte is better than YouTube or

any of the other video sharing services.

• Management. The system offers much of what IBM calls “autonomic” management. It

manages itself through multiple failure modes, offers automatic load balancing and

storage pooling, and provides features, such as the snapshots and 3 day window for dead

chunks to remain on the system, that give management an extra line of defense against

failure and mistakes. I’d love to know how many sysadmins it takes to run a system like

this.

• Cost. Storage doesn’t get any cheaper than ATA drives in a system box.

Yet as a general purpose commercial product, it suffers some serious shortcomings.

• Performance on small reads and writes, which it wasn’t designed for, isn’t good enough

for general data center workloads.

• The record append file operation and the “relaxed” consistency model, while excellent

for Google, wouldn’t fit many enterprise workloads. It might be that email systems,

where SOX requirements are pushing retention, might be redesigned to eliminate deletes.

Since appending is key to GFS write performance in a multi-writer environment, it might

be that GFS would give up much of its performance advantage even in large serial writes

in the enterprise.

• Lest we forget, GFS is NFS, not for sale. Google must see its infrastructure technology as

a critical competitive advantage, so it is highly unlikely to open source GFS any time

soon.

Looking at the whole gestalt, even assuming GFS were for sale, it is a niche product and would

not be very successful on the open market.

As a model for what can be done however, it is invaluable. The industry has strived for the last

20 years to add availability and scalability to an increasingly untenable storage model of blocks

and volumes, through building ever-costlier “bulletproof” devices.

GFS breaks that model and shows us what can be done when the entire storage paradigm is

rethought. Build the availability around the devices, not in them, treat the storage infrastructure

as a single system, not a collection of parts, extend the file system paradigm to include much of

what we now consider storage management, including virtualization, continuous data protection,

load balancing and capacity management.

GFS is not the future. But it shows us what the future can be.

What Will You Get?

We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.

Premium Quality

Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.

Experienced Writers

Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.

On-Time Delivery

Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.

24/7 Customer Support

Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.

Complete Confidentiality

Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.

Authentic Sources

We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.

Moneyback Guarantee

Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.

Order Tracking

You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.

image

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

image

Trusted Partner of 9650+ Students for Writing

From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.

Preferred Writer

Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.

Grammar Check Report

Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.

One Page Summary

You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.

Plagiarism Report

You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.

Free Features $66FREE

  • Most Qualified Writer $10FREE
  • Plagiarism Scan Report $10FREE
  • Unlimited Revisions $08FREE
  • Paper Formatting $05FREE
  • Cover Page $05FREE
  • Referencing & Bibliography $10FREE
  • Dedicated User Area $08FREE
  • 24/7 Order Tracking $05FREE
  • Periodic Email Alerts $05FREE
image

Our Services

Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.

  • On-time Delivery
  • 24/7 Order Tracking
  • Access to Authentic Sources
Academic Writing

We create perfect papers according to the guidelines.

Professional Editing

We seamlessly edit out errors from your papers.

Thorough Proofreading

We thoroughly read your final draft to identify errors.

image

Delegate Your Challenging Writing Tasks to Experienced Professionals

Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!

Check Out Our Sample Work

Dedication. Quality. Commitment. Punctuality

Categories
All samples
Essay (any type)
Essay (any type)
The Value of a Nursing Degree
Undergrad. (yrs 3-4)
Nursing
2
View this sample

It May Not Be Much, but It’s Honest Work!

Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.

0+

Happy Clients

0+

Words Written This Week

0+

Ongoing Orders

0%

Customer Satisfaction Rate
image

Process as Fine as Brewed Coffee

We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.

See How We Helped 9000+ Students Achieve Success

image

We Analyze Your Problem and Offer Customized Writing

We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.

  • Clear elicitation of your requirements.
  • Customized writing as per your needs.

We Mirror Your Guidelines to Deliver Quality Services

We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.

  • Proactive analysis of your writing.
  • Active communication to understand requirements.
image
image

We Handle Your Writing Tasks to Ensure Excellent Grades

We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.

  • Thorough research and analysis for every order.
  • Deliverance of reliable writing service to improve your grades.
Place an Order Start Chat Now
image

Order your essay today and save 30% with the discount code Happy