Analysis of Convolution Neural Network-Based Algorithm for Annotation Cost Estimation

Analysis of Convolution Neural Network-Based Algorithm for Annotation Cost Estimation in the Field of Supervised Machine Learning

Don't use plagiarized sources. Get Your Custom Essay on
Analysis of Convolution Neural Network-Based Algorithm for Annotation Cost Estimation
Just from $13/Page
Order Essay

Abstract

Active learning is an important machine learning process of selectively querying the users to label or annotate examples with the goal of reducing the overall annotation cost. Although most existing convolution neural network (CNN) work are based on a simple assumption that the cost of annotation for each labeling query is the same or fixed, the assumption may not be realistic. That is, in fact, the cost of annotation may vary between instances of data. In this work, I have studied and presented various annotation cost-sensitive active learning algorithms, which need to estimate the utility and cost of each query simultaneously. The goal is to build or merge different models of machine learning and reduce the total cost of labelling to train the model. Hence , I propose a technique for combining Latent Semantic Indexing (LSI) and Word Mover’s Distance (WMD) methods to come up with an efficient architecture which can work on different set of datasets, thus reducing the overall labelling/annotation cost in the field of supervised machine learning and validate that the proposed method is generally superior to other annotation cost-sensitive algorithms.

Keywords:  active learning, annotation, CNN, labelling, supervised learning

Table of Contents

Introduction

Architecture and Proposed Procedure

Task Analysis

Task1: Data Understanding

Task2: Data Preparation

Task3: Modelling

Task4: Evaluation of Results

Project Roadmap and Timeline

Credentials

Conclusion

Visual Graphics

References & Citations

 

Introduction

Traditional machine learning algorithms use any data labeled to induce a model. By contrast, an active learning algorithm can select which instances are labeled and added to the training set. A learner typically starts with a small set of labeled instances, selects a few informative instances from a pool of unlabeled data, and queries from an oracle (e.g., a human annotator) for labels. The objective is to reduce the overall annotation cost to train a model. The notion of annotation costs must be better understood and incorporated into the active learning process in order to genuinely reduce the labeling costs required to build an accurate model. Hence, I propose a technique for combining Latent Semantic Indexing (LSI) and Word Mover’s Distance (WMD) methods to come up with an efficient architecture which can work on different set of datasets, thus reducing the overall labelling/annotation cost in the field of supervised machine learning.

Active learning is a machine learning setup that enables machines to strategically “ask questions” to label the oracle (Settles, 2010) in order to reduce the cost of labeling. With regard to the number of examples, annotation costs have traditionally been measured, but it has been widely recognized that different examples may require different annotation efforts (Settles et al., 2008).

Vast quantities of unlabeled instances can be easily acquired in many machine learning scenarios, yet high-quality labels are expensive to obtain. For example, a massive number of experiments and analyzes are needed in fields such as medicine (Liu, 2004)) or biology (King et al., 2004) to label a single instance, while collecting samples is a relatively easy task. In setting up cost-sensitive active learning, there are some variations. In (Margineantu, 2005), it is assumed that the cost of labeling for all data instances is known before querying, while in (Settles et al., 2008), the cost of a data instance can only be bought after querying its label. In this work, I concentrate on the later setup that closely matches the real-world human annotation scenario. Existing works (Haertel et al., 2008) must therefore simultaneously estimate the utility and cost of each instance in the setup and select instances with a high utility and low cost.

Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing Service

The idea of sampling uncertainty (Lewis and Gale, 1994) is to query the data instance label with the classifier’s highest uncertainty. For example, in a support vector machine (SVM), (Tong and Koller, 2001) propose to query the data instance closest to the decision boundary; (Holub et al., 2008) selects data instances to be queried from a probabilistic classifier based on the entropy of label probabilities.

In Kang et al., 2004,Data instances closest to each cluster’s centroid are searched before using any other section criteria; (Huang et al., 2010) measures the representativeness of each data instance from both the unlabeled data in-stances cluster structure and the labeled data class assignments , and (Xu et al., 2003) clusters those data instances close to the SVM decision boundary and queries data instance labels close to the center of each cluster. In (Nguyen and Smeulders, 2004) clustering is used to estimate the probability of unlabeled data instances labeling, which is the key component in the measurement of data instance utilities.

There are various works targeting on annotation cost sensitive active learning with different problem settings, such as the querying target (Greiner et al., 2002), the number of the labelers (Donmez and Carbonell, 2008) the targeting classification problem (Yan and Huang, 2018) and the applied data domain (Vijayanarasimhan and Grauman, 2011).

In order to discuss cost-sensitive active learning with unknown costs, the first question to be answered is whether the cost of human annotation can be estimated accurately. In (Arora et al., 2009), Various unsupervised models are proposed to estimate the cost of annotation for corpus datasets, while (Settles et al., 2008) shows that the cost of annotation can be estimated accurately using a supervised model of learning.

Active learning is widespread framework with the ability to automatically select the most informative unlabeled examples for annotation. The motivation behind the sampling of uncertainty is to find some unlabeled examples closest to the data set labeled (nearest neighbor) and use them to assign the label. To achieve this, I am creating document classification using CNN for any unknown target label input article and doing a cosine similarity to finding the most similar documents as neighbors for the document in the training set without labels. It allows to assume fairly that the closest similar document can be labeled the same, this will facilitate the labeling of the oracle with a smaller set of inputs.

The architecture is combination of two major components, first is to collect and preprocess them and will explain the similarity measures and develop the related models. The architecture’s second part captures unlabeled data and uses different models to perform similarity checks. The output of the system is to use effective models to identify neighboring documents / articles. I am evaluating multiple models in this work to improve document similarity in order to reduce the overall labeling effort. For similarity score, I am using Word2Vec.Based on the Vector Space Model, two similarity measures based on word2vec (“Centroids” and “Word Mover’s Distance (WMD)”) will be studied and compared with the commonly used Latent Semantic Indexing (LSI). Also 20 newsgroups datasets will be used to compare the document similarity measures.

The following figure gives an overview of the methodology:

Figure 1: Overall Architecture

To implement this design modularly, I have divided the project into four independent tasks:

Data Understanding

Data Preparation – Prepare the data for machine learning algorithm

Modelling – Select model and train models

Evaluation of Results

Task1: Data Understanding

In order to conduct the testing, I have to assess data situation, obtain data (Access), once data is available it needs to be explored. I used data pipeline ETL tool PowerCenter Informatica for building data warehouse. It was deployed in Virtual Machine with following specifications:

Operating System: Windows Server 2012 R2 Standard

RAM :32 GB

CPU Cores :8 Core 2.40 GHz Processor

Kernel Version 9.3.9600.18821

Task2: Data Preparation

Data Preparation is the process of gathering, cleaning and consolidating data into a single file or data table, primarily for analysis purposes. I used Datawatch Monarch is the industry’s leading solution for self-service data preparation. Recommendation specification for using Datawatch Monarch are as follows:

Windows 10 – 8 GB memory

5 GB disk space

2GHz or faster processor

 Google Chrome

.NET Framework 4.5.2

Microsoft Access Database Engine 2010 version

 Microsoft SQLServer

Task3: Modelling

I went with Scikit-Learn, the Python programming language for machine learning library to implement some models quickly during this project. To get the data ready for machine learning, I have to take some basic steps: missing value imputation, encoding of categorical variables, and optionally feature selection if the input dimension is too large. Scikit-learn library requires following dependencies:

Python (>= 2.7 or >= 3.4)

NumPy (>= 1.8.2)

SciPy (>= 0.13.3)

Task4: Evaluation of Results

As part of testing, we compared the three methods (LSI, Centroid and WMD). First, a local analysis on a single example is done to get a sense of how well the methods work. Then a global analysis is done with a clustering task. A lemmatization step has been done, and duplicates are removed to make the table readable. The quality of the clustering task for each method is given by the following Normalized Mutual Information (NMI) values in Table 1.

Data Set

LSI

Centroid

WMD

ng20

0.40

0.31

NA

snippets

0.25

0.46

0.39

 

Table 1: Normalized Mutual Information

Finally, I compared the overall performance of the methods considered to common discrete methods of representation such as K-medoids, K-Means, Complete, Ward and DBSCAN.

Figure 2: Top Score Overall Comparison

Coming up with this distributed architecture as explained in above sections would require six steps as mentioned in timeline section below:

The first step involved reading and analyzing various relevant research papers and documents. This initial part would take around two weeks.

For the next three steps I have selected various existing algorithm and I am going to test and record results for each algorithm. Testing and recording results of LSI Algorithm will take a week.

In this step I will test, and record results of Centroid Algorithm. This part will take a week.

In this step I will test, and record results of WMD Algorithm. This part will also take a week.

In this step results of various algorithms as determined in above mentioned steps are compared across various matrices and identifying the bottle necks, this step will take one week.

The Final step involved combining of LSI and WMD algorithms and applying various optimizations steps to address the issues identified so that final algorithm reduces the total labelling/annotation cost in the field. This step will take two weeks.

 The following chart explains the details steps and timelines to accomplish the proposed study:

Figure 3: Gantt Chart for the steps and timeline

I am a lead member of technical staff at Salesforce with over 15 years of experience in software industry. I am responsible for building Test framework/harness design, development and execution for unit testing of Java based cloud services. I have also developed java-based tools for load & performance for applications within large-scale Linux Clustered. I have professional level experiences in following technologies: Java, Python, Big Data Technologies, Functional Testing, automation and performance engineering. I am leading Sales Cloud prediction quality team in Salesforce Prior to Salesforce I was working with Intuit Inc as Staff Engineer.

The existing literature provided well defined explanation and comparison of various algorithms calculating annotation/labeling costs in field of supervised machine learning, however it did not include any improvement like combining various models to come up with an architecture which can work on different set of datasets uniformly considering behavior and volume of the data, thus I managed to demonstrate for long texts corresponding to the 20 Newsgroups dataset, LSI is the best method; MD and the Centroid method both involve better clustering than LSI for the Web snippets dataset. and main focus in future work would be to investigate cost-sensitive active learning strategies that are more robust when given approximate, predicted annotation costs.

Figure 4: Steps involved for Analysis

Figure 5: Histogram for Annotation Time

Settles B (2010). Active learning literature survey. University of Wisconsin, Madison 52(55-66):11

Settles B, Craven M, Friedland L (2008). Active learning with real annotation costs. In: Proceedings of the NIPS workshop on cost-sensitive learning, pp 1–10

Liu Y (2004). Active learning with support vector machine applied to gene expression data for cancer classification. Journal of chemical information and computer science.

King RD, Whelan KE, Jones FM, Reiser PG, Bryant CH, Muggleton SH, Kell DB, Oliver SG (2004). Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427(6971):247–252

Margineantu DD (2005). Active cost-sensitive learning. In: Proceedings of International Joint Conference on Artificial Intelligence, pp 1622–1623

Lewis DD, Gale WA (1994). A sequential algorithm for training text classifiers. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Springer-Verlag New York, Inc., pp 3–12

Tong S, Koller D (2001). Support vector machine active learning with applications to text classification. Journal of machine learning research 2(Nov):45–66

Holub A, Perona P, Burl MC (2008). Entropy-based active learning for object recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, IEEE, pp 1–8

Kang J, Ryu KR, Kwon HC (2004). Using cluster-based sampling to select initial training set for active learning in text classification. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, pp 384–388

Huang SJ, Jin R, Zhou ZH (2010). Active learning by querying informative and representative examples. In: Advances in neural information processing systems, pp 892–900

Xu Z, Yu K, Tresp V, Xu X, Wang J (2003). Representative sampling for text classification using support vector machines. In: European Conference on Information Retrieval, Springer, pp 393–407

Nguyen HT, Smeulders A (2004). Active learning using pre-clustering. In: Proceedings of the 21th international conference on Machine learning, ACM, p 79

Donmez P, Carbonell JG (2008). Proactive learning: cost-sensitive active learning with multiple imperfect oracles. In: Proceedings of the 17th ACM conference on Information and knowledge management, ACM, pp 619–628

Guillory A, Bilmes J (2009). Average-case active learning with costs. In: International Conference on Algorithmic Learning Theory, Springer, pp 141–155

Cuong N, Xu H (2016). Adaptive maximization of pointwise submodular functions with budget constraint. In: Advances in Neural Information Processing Systems, pp 1244–1252

Vijayanarasimhan S, Grauman K (2011). Cost-sensitive active visual category learning. International Journal of Computer Vision 91(1):24–44

Arora S, Nyberg E, Ros ́e CP (2009). Estimating annotation cost for active learning in a multi-annotator environment. In: Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing, Association for Computational Linguistics, pp 18–26

 

What Will You Get?

We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.

Premium Quality

Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.

Experienced Writers

Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.

On-Time Delivery

Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.

24/7 Customer Support

Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.

Complete Confidentiality

Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.

Authentic Sources

We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.

Moneyback Guarantee

Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.

Order Tracking

You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.

image

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

image

Trusted Partner of 9650+ Students for Writing

From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.

Preferred Writer

Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.

Grammar Check Report

Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.

One Page Summary

You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.

Plagiarism Report

You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.

Free Features $66FREE

  • Most Qualified Writer $10FREE
  • Plagiarism Scan Report $10FREE
  • Unlimited Revisions $08FREE
  • Paper Formatting $05FREE
  • Cover Page $05FREE
  • Referencing & Bibliography $10FREE
  • Dedicated User Area $08FREE
  • 24/7 Order Tracking $05FREE
  • Periodic Email Alerts $05FREE
image

Our Services

Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.

  • On-time Delivery
  • 24/7 Order Tracking
  • Access to Authentic Sources
Academic Writing

We create perfect papers according to the guidelines.

Professional Editing

We seamlessly edit out errors from your papers.

Thorough Proofreading

We thoroughly read your final draft to identify errors.

image

Delegate Your Challenging Writing Tasks to Experienced Professionals

Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!

Check Out Our Sample Work

Dedication. Quality. Commitment. Punctuality

Categories
All samples
Essay (any type)
Essay (any type)
The Value of a Nursing Degree
Undergrad. (yrs 3-4)
Nursing
2
View this sample

It May Not Be Much, but It’s Honest Work!

Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.

0+

Happy Clients

0+

Words Written This Week

0+

Ongoing Orders

0%

Customer Satisfaction Rate
image

Process as Fine as Brewed Coffee

We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.

See How We Helped 9000+ Students Achieve Success

image

We Analyze Your Problem and Offer Customized Writing

We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.

  • Clear elicitation of your requirements.
  • Customized writing as per your needs.

We Mirror Your Guidelines to Deliver Quality Services

We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.

  • Proactive analysis of your writing.
  • Active communication to understand requirements.
image
image

We Handle Your Writing Tasks to Ensure Excellent Grades

We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.

  • Thorough research and analysis for every order.
  • Deliverance of reliable writing service to improve your grades.
Place an Order Start Chat Now
image

Order your essay today and save 30% with the discount code Happy