Online Learning Algorithms in Varied Bandit Scenarios

1.      ABSTRACT

Don't use plagiarized sources. Get Your Custom Essay on
Online Learning Algorithms in Varied Bandit Scenarios
Just from $13/Page
Order Essay

In this paper, we present a discussion on the performance of non-stochastic/adversarial-setting algorithms in 2 special scenarios:

When agents/players are given only partial or sample feedback

Case where agent’s reward information is falsified by the adversary.

2.     INTRODUCTION

A setting where a Bandit can observe the payout of all the N arms after a round is called full information. However, practical problems do not have this luxury. The Bandit/agent is given only limited feedback based on his own decision or he can observe the payout of only the arm he pulled in that round. This scenario is semi-bandit feedback.

Bandit theory was constrained from application to several practical problems due to the lack of a method or workaround to handle scenarios where a Bandit has no full information. But when Bandit Algorithms are combined with Online Optimization methods, accurate estimations of importance weights can be obtained. Online optimization methods differ from other optimization methods such as Stochastic Optimization, Robust Optimization, and Markov Decision Processes. Shortest path problems are an example of those using Online Optimization.

Let us say we have the following Adversarial MAB setting in-hand:

Consider a simple 2-player capture the ag game, which consists of 5 sites and 2 flags. At each round, Player 1 (the hider) chooses 2 sites to hide her flags. Note that each site can contain at most 1 flag. Player 2 (the seeker), without knowing the exact places of the flags, must guess where the flags could be. To do so, she can choose 2 sites to check whether the flags are hidden there. If the seeker finds both flags, she gets 2 points, if she finds only 1 flag, then her reward is 1, and 0 otherwise. The goal of the seeker is thus to maximize her total points over time, while the hider aims to minimize the total reward of her opponent. This problem can be equated to have 5 non-identical slot machines, and therefore enabling us to apply online learning algorithms that can cope with non-stationary, adversarial environments that help minimize Regret while venturing into both exploration and exploitation.

In the remainder of this paper we describe the approach taken and the performance of our strategies in two different scenarios of the above problem. Sections 3 and 4 describe the two scenarios and discuss the observations.

 

3.     PARTIAL FEEDBACK SCENARIO

Consider the following asymmetric game setting: Player 1 can see what Player 2 has chosen each round, while Player 2 can only see the places of the flags if they were at the chosen sites of Player 2. Play the game for 500 rounds and repeat this for 100 times.

 

To ensure the flag seeker can learn and minimize his regret in the above setting, we choose to implement a pair of combinations of adversarial bandit algorithms: EXP3 and FPL. Both chosen because of their fast/dynamic learning ability, and the ability to not depend on full information to update weights of each arm or in this case flag hiding sites. In other words, the algorithms are suitable for use in a partial feedback setting because they utilize only sample feedback or feedback from the arm chosen by the player, in order to update weights of the arms. We choose these algorithms for their simplicity. More sophisticated choices for such a scenario would include FPL-GR [2], and FPL-UE [3] algorithms which are popularly used in combinatorial bandit problems.

3.1        EXP3 Hider and EXP3 SEEKER

The only hyperparameters in this combination of algorithms chosen for the hider and seeker, we must choose beforehand is γ ∈ [0,1] which tunes the desire to pick an action uniformly at random. This also determines to what extent we balance exploration and exploitation. We study the effects of different values of γ used by both the hider and seeker and the reward earned by hider in each of these cases looking at the trend, and convergence. This is depicted in Fig 1.

Note: when γ =1, there is purely uniform random exploration, and the weights computed for different sites wouldn’t matter anymore. On the other hand, the algorithm has provision to exploit sites higher weights with probability 1- γ.

 

Observations

Hider Uses γ = 0.25:

Fig 1: Average Seeker reward plotted over the number of plays. Each plot shows the reward trend for different values of γ used by the Seeker, and a constant γ used by the Hider. The blue, green, gold, and red trends show the convergence of Seeker reward when Hider uses γ values: 0.25, 0.5, 0.75, and 1 respectively.

The first plot from the left depicts the scenarios when the Hider exploits with a probability of 75% and explores only 25% of the time. The blue trend indicates the convergence in reward earned by the Seeker when he explores just 25% of the time.  Since the Hider is mostly exploiting, the Seeker can better predict the hiding sites and earns highest reward even with little exploration.

As the Seeker chooses higher values of γ, his earnings tend to drop. The red trend is when the Seeker only explores. This strategy earns him the lowest average reward over all 100 games as he fails to utilize the information he collects during exploration.

As the Hider uses higher values of γ, he explores more he seems to discover better hiding spots. This is because the maximum reward earned by the Seeker with any γ value has dropped from the first(left-most) plot. By the time Hider is exploiting 75% of the time, the Seeker’s average reward has dropped from about 0.95 to 0.85.

This also implies the Seeker is struggling to predict the Hider’s moves, as the Hider opts to explore more, he randomizes his actions.

Hider only Explores(γ=1):

The last plot shows the reward trends when Hider only explores, and the Seeker chooses varying values of gamma.

Surprisingly, no matter what γ value the Seeker chooses, the average reward remains about the same around convergence.

Also, in all plots, there is evidence that whenever the Seeker is purely exploring, regardless of the Hider’s γ value, the average reward is the same at about 0.8(this is the red trend in all plots). Even with partial information, the Seeker was able to achieve an average reward of about 0.8.

3.2       EXP3 Hider and FPL SEEKER

In this experiment, we choose to implement the FPL strategy for the seeker. The game is played under the settings described earlier, and results are recorded in a similar manner to allow for comparison.  The hyperparameters chosen here are γ for the hider, and η for the FPL seeker. We plot the reward trend over 100 games for different combinations of these hyperparameters as in Fig 2.

Observations:

In the first three plots from the left, we observe that when Hider chooses to purely explore, the Seeker tends to gain the least average reward, when η is 1, 10, and 100. Unexpectedly, when η=1000, and when the Hider uses γ = 1(pure exploration), the average Seeker reward earned is higher than the reward earned with all other γ values.

2.       When Seeker η= 1, the Seeker earns a higher average reward than he could while using the EXP3. Similar observations are seen when η =10, and 100 as well.

Fig 2: Average Seeker reward plotted over the number of plays. Each plot shows the reward trends for different values of γ used by the Hider, and a constant η used by the Seeker. The blue, green, gold, and red trends show the convergence of Seeker reward when Hider uses γ values: 0.25, 0.5, 0.75, and 1 respectively.

Just like when using the EXP3, we see that the more adventurous the Seeker is, the more unpredictable the hiding sites are and consequently the lower our average reward is. For example, when Seeker’s γ is 0.25, he only explores 25 % of the time. Remaining time, he exploits sites that fetch him higher reward. The seeker is better able to learn the hiding pattern, and his average reward is higher as well.

As η increases from 1 to 1000, the average reward drops from about 1 to 0.85.

Also, when η=1000, the average reward our FPL Seeker earns is almost the same for all values of γ that the Hider used implying, for larger exponential functions that we use to perturb, the Seeker is more indifferent to the exploration/exploitation strategies of the Seeker.  The Hider is no longer able to limit the average reward earned.

4.         WHEN ADVERSARY FALSIFIES         REWARDS

Now suppose that the hider can lie about the position of the true flags during the game. For example, if the seeker chooses sites 1 and 2, and the flags were in fact on sites 3 and 4, the hider can still say that the seeker found the flags and received a reward of 2 (the hider can also lie about 1 site, not 2). Basically, when the hider lies, she can give any arbitrary information. To restrict the hider’s power, we assume that she can lie up to B times during a game. Can you come up with a smart faking strategy for the hider such that she can manipulate the seeker to have a low performance in the end (after revealing the fake flags)? Once you have the faking strategy, please play the game with B = 10, 50, 100 for 500 rounds, and repeat this for 100 times. Plot the average reward of the seeker (i.e., the total reward divided by the number of plays) over number of plays, both before and after the reveal for all three cases (B = 10, 50, 100). Also, please plot the difference between the true total reward (without the fake flags) and the total reward the seeker thought she received at the end of the game.

4.1        Lying Strategy

Each game is designed as follows as shown in Fig 3. The lying strategy is depicted in Fig4. This is a very simple strategy built with the intention of misleading or confusing the Seeker such that he updates his weights wrongly. The algorithmic flowchart is illustrating this clearly. We have three possibilities:

Fig 3. Program flow when adversary has provision to lie about the flag positions

Seeker is right about both positions: Randomly pick any of the guesses to mislead the Seeker about, let us say guess-1 or position 1. Let him think he is wrong. Return position vector such that there is a flag at any other random position instead of at guess 1.

Seeker is wrong about both positions: Randomly pick any of the guesses to mislead the Seeker about, let us say guess-1 or position 1. Let him think he is right. Return position vector such that there is a flag at guess1 even though the truth is that it is at some other position.

Seeker is wrong about one position: Pick the position he is wrong about. Let him think he is right.

The idea is, if we make the Hider think he is right even though he is wrong, he’ll give more importance to that position or wrongly update the weights and therefore influence his future guesses to be wrong as well.

Fig 4. Program flow when adversary has provision to lie about the flag positions

4.2        EXP3 Hider with falsifying powers, against EXP3

Fig 5: Plots above show the Average reward accumulated by both the Hider with falsifying powers (in blue), as well as the true, and fake reward accumulated by the Seeker; Green trend is the true reward while the black trend is the false reward). Plots from left to right show the convergence when the Seeker is lied to B = 10, 50, 100, 500 times respectively, in each game.

Observations:

When B=10: Hider can lie only 10 times in every game. The hider makes only slightly more than what he would without faking the positions.

When B is increased further: The gap is earnings between the two players is increased slightly. Hider can better mislead the Seeker. This gap increases when B is increased further.

When B=500: The Hider lies in every round. This is when the Seeker can make the least rewards. This is understandable as he is misled to believing false information in every round. But what is surprising is, despite this amount of false information given to him, he manages to make about 0.7 at convergence, while one might expect a much lower figure. One would also think that lying randomly even 10 times might be enough to make the hider update his weights wrongly and give him lower rewards. But it looks like the lying, unless done in every round, does not really throw off the Seeker.

Probably this is because, the lying strategy is not sophisticated enough. The lying strategy could be modified to ensure the Seeker looks in places he weights the least. But in the current strategy, we only randomly redirect him to a false location, while we could have tried to make him look in the place, he gives least weight to, or the place the Hider gives least weights to. That means, if we redirected him to look in a place that the hider would least probably hide, he could be misled better into making mistakes while guessing in future rounds.

Figure 5 reconfirms our understanding that lying in every single round fetches the maximum reward for the Hider. Also, lying for just 10 rounds is as good as not lying/faking at all. The reward accumulated by the Seeker decreases as the lying power of the Hider increases. This gap in earnings maybe widened with a more sophisticated lying strategy as mentioned earlier.

Fig 6. Plot shows the difference in reward accumulated between players against types of games

5.   REFERENCES

[1] G. Neu and G. Bartók. An efficient algorithm for learning with semi-bandit feedback. ALT, pages 234–248, 2013.

[2] Neu, Gergely and Gábor Bartók. “Importance weighting without importance weights: An efficient algorithm for combinatorial semi-bandits.” Journal of Machine Learning Research 17 (2016): 154:1-154:21.

[3] Xu, Haifeng et al. “Playing Repeated Security Games with No Prior Knowledge.” AAMAS (2016).

[4] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.

[5] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 
 

What Will You Get?

We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.

Premium Quality

Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.

Experienced Writers

Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.

On-Time Delivery

Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.

24/7 Customer Support

Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.

Complete Confidentiality

Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.

Authentic Sources

We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.

Moneyback Guarantee

Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.

Order Tracking

You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.

image

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

image

Trusted Partner of 9650+ Students for Writing

From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.

Preferred Writer

Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.

Grammar Check Report

Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.

One Page Summary

You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.

Plagiarism Report

You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.

Free Features $66FREE

  • Most Qualified Writer $10FREE
  • Plagiarism Scan Report $10FREE
  • Unlimited Revisions $08FREE
  • Paper Formatting $05FREE
  • Cover Page $05FREE
  • Referencing & Bibliography $10FREE
  • Dedicated User Area $08FREE
  • 24/7 Order Tracking $05FREE
  • Periodic Email Alerts $05FREE
image

Our Services

Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.

  • On-time Delivery
  • 24/7 Order Tracking
  • Access to Authentic Sources
Academic Writing

We create perfect papers according to the guidelines.

Professional Editing

We seamlessly edit out errors from your papers.

Thorough Proofreading

We thoroughly read your final draft to identify errors.

image

Delegate Your Challenging Writing Tasks to Experienced Professionals

Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!

Check Out Our Sample Work

Dedication. Quality. Commitment. Punctuality

Categories
All samples
Essay (any type)
Essay (any type)
The Value of a Nursing Degree
Undergrad. (yrs 3-4)
Nursing
2
View this sample

It May Not Be Much, but It’s Honest Work!

Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.

0+

Happy Clients

0+

Words Written This Week

0+

Ongoing Orders

0%

Customer Satisfaction Rate
image

Process as Fine as Brewed Coffee

We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.

See How We Helped 9000+ Students Achieve Success

image

We Analyze Your Problem and Offer Customized Writing

We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.

  • Clear elicitation of your requirements.
  • Customized writing as per your needs.

We Mirror Your Guidelines to Deliver Quality Services

We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.

  • Proactive analysis of your writing.
  • Active communication to understand requirements.
image
image

We Handle Your Writing Tasks to Ensure Excellent Grades

We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.

  • Thorough research and analysis for every order.
  • Deliverance of reliable writing service to improve your grades.
Place an Order Start Chat Now
image

Order your essay today and save 30% with the discount code Happy