R language assignment

Would you be able to do the below attached assignment which is R language data science.

MAT0

Don't use plagiarized sources. Get Your Custom Essay on
R language assignment
Just from $13/Page
Order Essay

2

2 Foundations of Statistics and Data Science

Summative Assessment 2020/2

1

MAT022 Foundations of Statistics and Data Science

Summative Assessment 2020/21

Summative assessment for the module is by means of a single report on your statistical anal-

ysis of data related to the National Basketball Association (NBA), a men’s professional

basketball league in the USA.

This form of assessment has been chosen because as professional statisticians and data scientists,

you will often be asked to investigate a data set and report on whether it contains anything useful

or interesting. The assessment will also help you to prepare for writing your MSc dissertation

in the summer.

Assessment type Weight Max. length Format Deadline

Report 100% 10 pages R Markdown and PDF Tue 09 Feb 2021

Your report will be assessed according to how well you are able to

• analyse the data set,

4

0%

• interpret the results of your analysis, and

3

0%

• present the results of your analysis and your interpretation of the data set. 30%

Your analysis should be performed using the R statistical software package, and your report

prepared using the R Markdown typesetting system and the template provided.

1

  • The data
  • You are asked to write a report on data from the 2014–1

    5

    season of the National Basketball

    Association (NBA), a men’s professional basketball league in the USA. The data set is a record

    of all shots taken by players in the NBA between October 2014 and March 2015, and consists

    of 128,069 observations on 23 variables as described in Table 1.

    Please submit your report in PDF format, together with the R Markdown file used to generate

    the report, through Learning Central sometime before 12.00 on Tuesday 09 February 2021.

    1

    MAT022 Foundations of Statistics and Data Science Summative Assessment 2020/21

    Variable Description

    GAME_ID Unique id number of the game.

    DATE Date of the game.

    HOME_TEAM Team playing at home.

    AWAY_TEAM Team playing away from home.

    PLAYER_NAME Name of the shooting player.

    PLAYER_ID Unique id number of the shooting player.

    LOCATION Whether the player was on the home (H) or away (A) team.

    W Whether the player’s team won (W) or lost (L) the game.

    FINAL_MARGIN The margin of victory for the player’s team (negative means defeat).

    SHOT_NUMBER The number of the shot taken by the shooting player in that game.

    PERIOD The period of the game that the shot was taken.

    GAME_CLOCK The time remaining in the period when the shot was taken.

    SHOT_CLOCK The time remaining on the shot clock.

    DRIBBLES Number of dribbles by the player before the shot was taken.

    TOUCH_TIME The time that the ball was in the shooting player’s hand.

    SHOT_DIST The distance of the shooting player from the basket.

    PTS_TYPE 2 for shots from inside the arc, 3 for shots from outside the arc.

    SHOT RESULT Whether the shot was successful (‘made’) or unsuccessful (‘missed’)

    CLOSEST_DEFENDER Name of the nearest defender when the shot was taken.

    CLOSEST_DEFENDER_ID Unique id number of the nearest defender.

    CLOSE_DEF_DIST Distance of the nearest defender when the shot was taken.

    FGM Equal to 1 if the shot was made (scored) otherwise 0.

    PTS The number of points scored (0, 2 or 3)

    Table 1: Description of the variables in the NBA shot logs data set.

    ATL Atlanta Hawks

    BKN Brooklyn Nets

    BOS Boston Celts

    CHA Charlotte Hornets

    CHI Chicago Bulls

    CLE Cleveland Cavaliers

    DAL Dallas Mavericks

    DEN Denver Nuggets

    DET Detroit Pistons

    GSW Golden State Warriors

    HOU Houston Rockets

    IND Indiana Pacers

    LAC Los Angeles Clippers

    LAL Los Angeles Lakers

    MEM Memphis Grizzlies

    MIA Miami Heat

    MIL Milwaukee Bucks

    MIN Minnesota Timberwolves

    NOP New Orleans Pelicans

    NYK New York Knicks

    OKC Oklahoma City Thunder

    ORL Orlando Magic

    PHI Philadelphia 76ers

    PHX Phoenix Suns

    POR Portland Trail Blazers

    SAC Sacramento Kings

    SAS San Antonio Spurs

    TOR Toronto Raptors

    UTA Utah Jazz

    WAS Washington Wizards

    Table 2: Acronyms for the teams in the NBA.

    2

    MAT022 Foundations of Statistics and Data Science Summative Assessment 2020/21

    2

  • The report
  • The ability to write clearly and concisely is an important professional competence. To encourage

    writing that is brief and to the point, your reports are limited to a maximum of 10 pages. It is

    often far more difficult to express yourself in 100 words than in 1000 words, especially when you

    have a lot to say, so be careful not to underestimate the challenge posed by this restriction. The

    modest page limit will also encourage you to be selective in the results you choose to present.

    A suggested structure for your report is shown in Table 3. Note that the title page, abstract,

    table of contents, list of references and appendix do not contribute towards the page count.

    Title 1 page
    Abstract 100 words
    Table of contents –

    1. Introduction 1/2 page
    2. Background 1 page
    3. (descriptive analysis) 2 pages
    4. (inferential analysis) 2 – 3 pages
    5. (inferential analysis) 2 – 3 pages
    6. (inferential analysis) 2 – 3 pages
    7. Conclusion 1/2 page

    References –
    Appendices 2 pages max.

    Table 3: Suggested report structure

    • The title page should contain the title of your report, your name and student number,
    and the date on which your report was completed.

    • The abstract should contain a short summary of the report and its main conclusions.

    • The table of contents should list the number and title of each section against the number
    of the page on which the section begins.

    • The introduction should consist of a few short paragraphs, describing the purpose of the
    report and providing a brief outline of its contents.

    • The background section should include a brief review of any relevant literature, and
    provide a context for the work presented in the report.

    • The report should contain a relatively short section on a descriptive analysis of the data
    set, with a title chosen to reflect what the section contains.

    • The main part of the report should consist of two or three sections on different inferential
    analyses of the data set. Here you should formulate hypotheses, conduct statistical tests,

    then present and discuss the results of these tests. The titles of these sections should

    reflect what the sections contain.

    • The conclusion should consist of a few short paragraphs, providing a summary of the
    report and a brief outline of some ideas for future work.

    • The report may contain a single appendix for large figures and tables, limited to a
    maximum of two pages.

    3

    MAT022 Foundations of Statistics and Data Science Summative Assessment 2020/21

    3

  • Assessment criteria
  • Detailed assessment criteria are shown in Table 4.

    Level Analysis
    (40%)

    Discussion
    (30%)

    Presentation
    (30%)

    Distinction
    (70–100)

    Hypotheses are inter-
    esting and original.
    Methods are appro-
    priate and applied
    carefully and precisely.
    An interesting de-
    scriptive analysis is
    included and reported
    correctly.

    Inferences are valid
    and supported by
    evidence. Original and
    interesting conclusions
    are articulated. There
    is some shrewd spec-
    ulation about possible
    causal factors.

    A high standard of
    writing is maintained
    throughout. The nar-
    rative is clear, coher-
    ent, eloquent and re-
    fined. Figures and
    tables are used cre-
    atively.

    Merit
    (60–69)

    Hypotheses are formu-
    lated correctly. Meth-
    ods are appropriate
    and applied correctly.
    A moderately interest-
    ing descriptive analy-
    sis is included and re-
    ported correctly.

    Inferences are valid
    and supported by
    evidence. Interesting
    conclusions are artic-
    ulated. There is some
    speculation about
    possible causal factors.

    A good standard or
    writing is maintained
    throughout. The nar-
    rative is clear and co-
    herent. Figures and ta-
    bles are used to illus-
    trate the narrative.

    Pass
    (50–59)

    Hypotheses are for-
    mulated correctly.
    Methods are applied
    correctly for the most
    part. A descriptive
    analysis is included
    and reported correctly.

    Inferences are mostly
    valid and supported
    by some evidence.
    Some relatively inter-
    esting conclusions are
    articulated.

    An acceptable stan-
    dard of writing is
    maintained through-
    out. The narrative
    is lacklusture and
    sometimes unclear.
    Figures and tables do
    not always illustrate
    the narrative.

    Fail
    (0–49)

    The analysis is bland
    and almost entirely de-
    scriptive.

    Inferences are invalid
    or not supported by ev-
    idence. There is little
    of any interest.

    The report is poorly
    written. The narrative
    is disjointed and hard
    to follow.

    Table 4: Assessment criteria

    Plagiarism

    You may find existing studies of the NBA data set online. Plagiarism is to present other people’s

    work or ideas as your own, by incorporating it into your work without full acknowledgement.

    The need to acknowledge others’ work applies not only to text, but also to computer code,

    figures, tables etc. You must also attribute text, data, or other resources downloaded from

    websites. Following submission your report will be analysed by the TurnitIn software, and any

    report in which plagiarism is detected will receive a mark of zero.

    4

    MAT022 Foundations of Statistics and Data Science Summative Assessment 2020/21

    4

  • Guidelines for writing reports
  • The golden rule when writing is to always think of the reader. For scientific reports, readers

    will typically want to read something interesting and learn something in the process.

    What do we mean by interesting?

    Not interesting The average exam mark of statistics and data science students.

    Quite interesting The average mark of male students, the average mark of female students,

    and the results of a test of whether any difference is statistically significant.

    Very interesting The average mark of male students, the average mark of female students, a

    statistical test of whether any difference is significant, and some speculation

    about why there is a significant difference, or alternatively why there is not.

    Audience. The target audience for your report is this year’s cohort students on the Founda-

    tions of Statistics and Data Science module, so you can assume that your readers are familiar

    with the methods and terminology established within the lectures and notebooks. If you choose

    to use methods that have not been covered in lectures, you must ensure that any new terms are

    properly defined and references to the relevant literature included.

    Analysis. The reader shoud be satisfied that you have performed your analysis correctly,

    and in particular that you have verified the conditions that are necessary to apply the various

    methods. Your methods should be introduced with a brief summary of their main features, but

    technical details should not be discussed at length although you might consider providing the

    interested reader with references to the relevant literature.

    Navigation. Do not assume that the reader will read the report from start to finish, as one

    might read a novel. Reports should be made easy to navigate using numbered sections and

    subsections together with cross-referencing. Once you have written a first draft, it will need

    careful editing before it becomes a coherent and polished report. This stage always takes longer

    than you think!

    Scientific writing. For scientific reports we aim for a style of writing that is clear and concise.

    Make sure that sentences are unambiguous and that a good standard of writing is maintained

    throughout the report.

    • Sections should not start abruptly with the subject matter, but rather with an introductory
    sentence or short paragraph. Sections should also end with concluding sentence or short

    paragraph.

    • All figures and tables must be numbered and have captions. Figures or tables that are not
    mentioned at least once in the text should not be included.

    • A qualified statement is one that express some level of uncertainty about its own accuracy,
    and should always be used when drawing conclusions from the results of a statistical

    analysis, and especially when speculating about possible causal factors. Common phrases

    that indicate qualified statements include “This suggests that …”, “It appears that …”,

    “We might conclude that …”, “There is some evidence to indicate …” and so on.

    5

      The data
      The report
      Assessment criteria
      Guidelines for writing reports


    title: “A study of the Decathlon dataset”
    author: “A Student”
    email: “StudentA@cardiff.ac.uk”
    date: “`r format(Sys.time(), ‘%d %B %Y’)`”
    fontsize: 11pt
    fontfamily: times
    geometry: margin=1in
    output:
    bookdown::pdf_document2:
    toc: true
    number_sections: true
    keep_tex: true
    citation_package: natbib
    fig_caption: true

    #toc_depth: 1
    highlight: haddock
    df_print: kable
    extra_dependencies:
    caption: [“labelfont={bf}”]
    #pdf_document:
    # extra_dependencies: [“flafter”]
    #pdf_document2:
    # extra_dependencies: [“float”]
    bibliography: [refs.bib]
    biblio-style: apalike
    link-citations: yes

    abstract: We demonstrate how various descriptive and inferenrial statistical methods can be applied to the `Decathlon` dataset, and how their results might be interpreted and presented. We study the evolution of athlete performance as function of time, and in show that while the best performances appear to increase with time, the mean and median performances appear to decrease over the same period. We illustrate the non-homogeneity of the current decathlon scoring scheme, and give some insight into the profile of the best-performing decathletes. We also perform a correlation analysis to explore relationships between the different events of the decathlon, and finally present the results of a logistic-regresion analysis and demonstrate that to a certain extent it is possible to distinguish between French and German decathletes by the scores the achieve on a subset of the decathlon events.


    “`{r setup, include=FALSE}
    library(knitr)
    library(tidyverse)
    library(kableExtra)
    knitr::opts_chunk$set(echo = TRUE)
    “`
    “`{r, include=FALSE}
    library(corrplot)
    library(Hmisc)
    library(car)
    library(ppcor)
    library(ggpubr)
    library(MASS)
    library(pROC)
    # read data
    Decathlon <- read.csv('decathlon.csv', header=TRUE) Nobs <- nrow(Decathlon) # number of entries ```

    # Introduction {#sec:intro}
    The `Decathlon` data set records the performances of elite decathletes in international competitions over the period from 1985 to 2006. The decathlon is a combined event in athletics consisting of ten track-and-field events; the current decathlon world-record holder is the French decathlete Kevin Mayer, who achieved a score of $9{,}126$ points in 2018. One may refer to the following Wikipedia page for more information of the decathlon: [https://en.wikipedia.org/wiki/Decathlon](https://en.wikipedia.org/wiki/Decathlon).
    The data set consists of $7{,}968$ observations of $24$ variables. These names of the variables are listed hereafter:
    “`{r, echo = FALSE}
    noquote(‘Variables names:’)
    print(names(Decathlon))
    “`
    An entry of the data set consists of the total number of points scored by a decathlete, the name and nationality of the decathlete, and the year the performance was achieved. The raw performances for each of the 10 events are reported (with time in seconds and distance or height in meters), together with the number of points scored for these events. There are $2{,}709$ different decathletes of $107$ different nationalities in the data set.
    **Remark**. For $435$ entries of the data set, we observed a difference of $1$ between the variable `Totalpoints`, corresponding to the total number of points scored by a decathlete for that performance, and the sum of the points scored during the $10$ events. We decided to apply a correction to the corresponding entries of the variable `Totalpoints`, so that they are all equal to the sum of the points scored during the $10$ events.
    “`{r}
    # Correction of `Totalpoints`
    Decathlon$Totalpoints <- rowSums(Decathlon[,15:24]) ``` In our study we pay a particular attention to the total number of points, the points scored during each event, the nationality of the decathletes, and the year the performances were achieved. The Pearson correlations between the total number of points and the points scored during the various events is illustrated by the correlogram shown in Figure \@ref(fig:correlo-pts-evt), while scatter plots and histograms showing the points scored during the three throw events (shot-put, discus and javelin throw) are presented in Figure \@ref(fig:pairs-throw-evt) (see appendix). ```{r correlo-pts-evt, echo = FALSE, fig.height = 4, fig.width = 4, fig.cap = "Graphical representation of the Pearson correlations between the total number of points and the points scored during each event.", fig.align = "center"} CorrMat <- cor(Decathlon[c(1,15:24)]) corrplot(CorrMat, method="circle") ``` # Performance across years {#sec:perf-year} In this section, we investigate the evolution of the overall performance (variable `Totalpoints`) as function of the year these performances were achieved (variable `yearEvent`). The number of obsevations for each year (or season) varies between $321$ and $399$. A graphical representation of the evolution of the best, mean and median preformace as a function of the year is shown in Figure \@ref(fig:Evo-Perf-Year). ```{r Evo-Perf-Year, echo = FALSE, fig.height = 5, fig.width = 9, fig.cap = "Evolution of the best, mean and median performance as a function of the year. In each case, the regression line (dashed black line) is shown togeter with the corresponding $95\\%$ confidence intervals (purple dashed curves; prediction without noise term).", fig.align = "center"} Ind_Best_Year <- rep(0,22) Mean_Perf_Year <- rep(0,22) Median_Perf_Year <- rep(0,22) for (i in 0:21){ Ind_Best_Year[i+1] <- which.max(Decathlon$Totalpoints * as.numeric(Decathlon$yearEvent == (1985+i))) Mean_Perf_Year[i+1] <- mean(Decathlon$Totalpoints[Decathlon$yearEvent == (1985+i)]) Median_Perf_Year[i+1] <- median(Decathlon$Totalpoints[Decathlon$yearEvent == (1985+i)]) } Top_Perf_Year <- Decathlon$Totalpoints[Ind_Best_Year] Perf_Year <- data.frame(Year=1985:2006, Top=Top_Perf_Year, Mean=Mean_Perf_Year, Median=Median_Perf_Year) mod1 <- lm(Top ~ Year, data=Perf_Year) mod2 <- lm(Mean ~ Year, data=Perf_Year) mod3 <- lm(Median ~ Year, data=Perf_Year) #summary(mod3) par(mfrow=c(1,2)) xcoord <-seq(1985, 2006,length.out=101) newData <- data.frame(Year=xcoord) #pred.w.plim <- predict(mod1, newData, # interval = "prediction", level=0.95) pred.w.clim <- predict(mod1, newData, interval = "confidence", level=0.95) matplot(xcoord, #cbind(pred.w.clim, pred.w.plim[,-1]), cbind(pred.w.clim), lty = c(2,2,2), lwd=3, #col = c('black', 'green3', 'green3', 'purple', 'purple'), col = c('black', 'purple', 'purple'), type = "l", xlab = 'Year', ylab = 'Total points', ylim = c(min(Top_Perf_Year),max(Top_Perf_Year)), #main='prediction for Male' ) lines(1985:2006, Top_Perf_Year, type='o', pch=19, col='red', cex=1, lwd=2) legend('topleft', c('Best'), col=c('red'), lty=1, lwd=2, pch=19, pt.cex=1) ################################# #pred.w.plim <- predict(mod2, newData, # interval = "prediction", level=0.95) pred.w.clim <- predict(mod2, newData, interval = "confidence", level=0.95) matplot(xcoord, #cbind(pred.w.clim, pred.w.plim[,-1]), cbind(pred.w.clim), lty = c(2,2,2), lwd=3, #col = c('black', 'green3', 'green3', 'purple', 'purple'), col = c('black', 'purple', 'purple'), type = "l", xlab = 'Year', ylab = 'Total points', ylim = c(min(Median_Perf_Year), max(Mean_Perf_Year)), #main='prediction for Male' ) lines(1985:2006, Mean_Perf_Year, type='o', pch=19, col='chartreuse3', cex=1, lwd=2) #pred.w.plim <- predict(mod3, newData, # interval = "prediction", level=0.95) pred.w.clim <- predict(mod3, newData, interval = "confidence", level=0.95) matlines(xcoord, #cbind(pred.w.clim, pred.w.plim[,-1]), cbind(pred.w.clim), lty = c(2,2,2), lwd=3, #col = c('black', 'green3', 'green3', 'purple', 'purple'), col = c('black', 'purple', 'purple'), type = "l", xlab = 'Year', ylab = 'Total points', ylim = c(min(Median_Perf_Year),max(Mean_Perf_Year)), #main='prediction for Male' ) lines(1985:2006, Median_Perf_Year, type='o', pch=19, col='orange', cex=1, lwd=2) legend('topright', c('Mean', 'Median'), col=c('chartreuse3', 'orange'), lty=1, lwd=2, pch=19, pt.cex=1) ``` ## Evolution of performances {#sec:evo-perf} Figure \@ref(fig:Evo-Perf-Year) suggests that the best season performance increases with time, while the mean and median season performances appear to decrease with time. This observation is confirmed by tests of *Spearman's rank correlation* between the time variables and the best, mean and median season performances. The results are shown in Table \@ref(tab:Evo-Perf-Corr-Test), where we observe * for the best season performance a positive Spearman correlation of $0.398$, statistically significant at the significance level $\alpha=0.05$; * for the mean season performance a negative Spearman correlation of $-0.551$, statistically significant at the significance level $\alpha=0.01$; * for the median season performance a negative Spearman correlation of -0.581, statistically significant at the significance level $\alpha=0.01$. ```{r Evo-Perf-Corr-Test, echo=FALSE} CorrT1 <- cor.test(1985:2006, Top_Perf_Year, method='spearman', alternative='greater') CorrT2 <- cor.test(1985:2006, Mean_Perf_Year, method='spearman', alternative='less') CorrT3 <- cor.test(1985:2006, Median_Perf_Year, method='spearman', alternative='less', exact=FALSE) Evo_Perf_Test_res <- data.frame(performance=c('Best', 'Mean', 'Median'), alternative=c(CorrT1$alternative,CorrT2$alternative,CorrT3$alternative), p.value=c(CorrT1$p.value,CorrT2$p.value,CorrT3$p.value), estimate=c(CorrT1$estimate,CorrT2$estimate,CorrT3$estimate)) knitr::kable( Evo_Perf_Test_res, caption = 'Result of the Spearman\'s rank correlation tests between the year and the season\'s best, mean and median performances.', align = 'cccc', booktabs = TRUE)%>%kable_styling(latex_options = “HOLD_position”)
    “`

    ## Comparison of mean season performances {#sec:comp-season-perf}
    Figure \@ref(fig:CI-Var-Mean-Year) gives an overview of the sample mean and sample variance of the performances achieved each year as represented by the `Totalpoints` variable; the corresponding confidence intervals (at the confidence level $70\%$) are also presented. We observe that 1988 has the largest sample mean and the second-largest sample variance among all years, while 1996 has the second-largest sample mean and the largest sample variance. Interestingly, 1988 and 1996 were both Olympic years, with the 1988 Games now infamous for many proven doping cases. We also observe a relatively strong overlap between the confidence intervals corresponding to these two years. The number of observations for each year are relatively large (at least $321$ observations), so by the central limit theorem the normal approximations used to compute these intervals are likely to be reasonably good.

    “`{r CI-Var-Mean-Year, echo = FALSE, fig.height = 5, fig.width = 9, fig.cap = “Sample mean and sample variance for the performances achieved each season, and correponding confidence intervals at the confidence level $70\\%$. The horizontal intervals corresponds to the means (in orange), and the vertical ones to the variances (in green).”, fig.align = “center”}
    VARxCIxComp<-function(Obs, alpha){ Nsamp<-length(Obs) sss2<- var(Obs) LowBound<- (Nsamp-1) * sss2 / qchisq(1-alpha/2, df=Nsamp-1) UpBound <- (Nsamp-1) * sss2 / qchisq(alpha/2, df=Nsamp-1) return(c(LowBound, UpBound)) } alpha <- 0.3 Mean_TotPts_Year <- rep(0,22) Var_TotPts_Year <- rep(0,22) CI_mean_TotPts_Year <- matrix(0, 22, 2) CI_var_TotPts_Year <- matrix(0, 22, 2) Rec_numb_obs <- rep(0,22) for (i in 0:21){ obs <- Decathlon$Totalpoints[Decathlon$yearEvent == (1985 + i)] CI_mean <- t.test(obs, conf.level=1-alpha)$conf.int CI_mean_TotPts_Year[i+1,] <- CI_mean Mean_TotPts_Year[i+1] <- mean(obs) CI_var_TotPts_Year[i+1,] <- VARxCIxComp(obs, alpha) Var_TotPts_Year[i+1] <- var(obs) } plot(c(),c(), type='p', pch=3, cex=0.5, lwd=2, col='blue', xlab='Mean', ylab='Variance', xlim=c(7275, 7460), ylim=c(130000,215000)) for (i in 1:22){ lines(CI_mean_TotPts_Year[i,], c(1,1)*Var_TotPts_Year[i], lty=1, lwd=3, col='orange') lines(c(1,1)*Mean_TotPts_Year[i], CI_var_TotPts_Year[i,], lty=1, lwd=3, col='chartreuse3') } for (i in 1:22){ text(Mean_TotPts_Year[i]+3, Var_TotPts_Year[i], sprintf('%02d',(1984+i) %% 100), pos=3, cex = 1.1) } points(Mean_TotPts_Year, Var_TotPts_Year, type='p', pch=3, cex=0.5, lwd=2, col='blue') ``` Three different tests for homogeneity of variances return p-values between $0.052$ and $0.102$, as shown in Table \@ref(tab:TotPts-Year-Homosced). These results indicate that there is no strong statistical evidence to suggest that the variance of athlete performances differs across years: for each of the three tests we do not reject the null hypothesis of homoscedasticity at significance level $\alpha=0.05$. ```{r TotPts-Year-Homosced, echo=FALSE} Decathlon$yearEvent_AsFactor <- factor(Decathlon$yearEvent) Thmscd1 <- bartlett.test(Totalpoints ~ yearEvent_AsFactor, data=Decathlon) Thmscd2 <- fligner.test(Totalpoints ~ yearEvent_AsFactor, data=Decathlon) Thmscd3 <- leveneTest(Totalpoints ~ yearEvent_AsFactor, data=Decathlon) Test_Tot_Year_Homosced <- data.frame(Test=c('Bartlett', 'Fligner-Killeen', 'Levene'), p.value=c(Thmscd1$p.value, Thmscd2$p.value, Thmscd3[1,3])) knitr::kable( Test_Tot_Year_Homosced, caption = 'Tests for homogeneity of variance for the performances (i.e. scores achieved by the decathletes) achieved each year.', align = 'cc', booktabs = TRUE)%>%kable_styling(latex_options = “HOLD_position”)
    “`

    Next we apply a one-way ANOVA to see whether there is a significant difference between the mean performance achieved each year, and find that it returns a p-value smaller than $0.0002$ (see below), which suggests that there is a statistically significant difference between the mean performance for at least two years. Notice that the number of observations for each year is roughly the same, so the data are approximately balanced. Tests indicate that the data are not normally distributed, however the number of observations in each year group is relatively large (at least than $321$), so by the central limit theorem the conclusion of this ANOVA is likely to be valid. We can also report that ANOVA performed with the `oneway.test` function (which does not assume assuming equality of variance) returns a p-value of similar magnitude.
    “`{r}
    AOV1 <- aov(Totalpoints ~ yearEvent_AsFactor, data=Decathlon) print(summary(AOV1)) ``` ## Comparison between 1988 and 1996 {#sec:1988-1996} In Figure \@ref(fig:Evo-Perf-Year) we observe that the sample mean of the overall performances for 1988 is the largest among all years. (A similar observation holds for the sample median, while 1988 interestingly also has the the lowest season's best performance.) We now test to see whether there is a statistically significant difference between the mean performances for $1988$ and $1996$, the year having the second-largest sample mean of the overall performance. ```{r, echo=FALSE} PvalFt <- (var.test(Decathlon$Totalpoints[Decathlon$yearEvent == 1988], Decathlon$Totalpoints[Decathlon$yearEvent == 1996], alternative = "two.sided", conf.level=0.95))$p.value PvalTt <- (t.test(Decathlon$Totalpoints[Decathlon$yearEvent == 1988], Decathlon$Totalpoints[Decathlon$yearEvent == 1996], alternative = "two.sided", conf.level=0.95))$p.value noquote( sprintf('F-test (compararison of variances); p-value=%.4f', PvalFt) ) noquote( sprintf('t-test (compararison of means); p-value=%.4f', PvalTt) ) ``` The $F$-test for equality of variances indicates that there is no statistical evidences to suggest a difference between the variance of the two samples. The two-sample $t$-test then indicates that there is no statistical evidence to suggest a difference between the mean of the two samples. Note that that although the two samples might not be normally distributed, the relatively large sample sizes ensure that normal approximation will work well here. # Differences between events {#sec:diff-evts} Figure \@ref(fig:Diff-Pts-Evt) illustrates the sample mean, sample median and sample variance of the number of points scored in each of the $10$ decathlon events, where we observe that the scheme for awarding points does not appear to be homogeneous. For example, decathletes on average seem to score more points for the 110m hurdles than for the javelin event, and the variance of the points scored for the pole vault appears to be significantly larger that the variance of the points scored for the 100m. ```{r Diff-Pts-Evt, echo = FALSE, fig.height = 3.5, fig.width = 9, fig.cap = "Sample mean, median and variance od the number of points scored during each event. Values for the mean and the median read on the left-hand-side axis, and on the right-hand-side axis for the variance. ", fig.align = "center"} EvtNames <- colnames(Decathlon)[15:24] VarEvt <- apply(Decathlon[,15:24], 2, var) MedianEvt <- apply(Decathlon[,15:24], 2, median) MeanEvt <- apply(Decathlon[,15:24], 2, mean) par(mar=c(3,3,1,3)) CEX1 <- 2 CEX2 <- 1.8 CEX3 <- 1.2 plot(1:10, MedianEvt, type='o', pch=17, lwd=3, cex=CEX2, col='orange', xlab='', ylab='', xaxt='n') lines(1:10, MeanEvt, type='o', lty=2, pch=18, lwd=3, cex=CEX1, col='chartreuse3') axis(1, at=seq(1,10,by=1), labels=EvtNames, las=0) legend('topright', c('Mean', 'Median', 'Variance'), col=c('chartreuse3', 'orange', 'firebrick2'), lty=c(2,1,1), lwd=3, pch=c(18,17,19), pt.cex=c(CEX1,CEX2, CEX3), y.intersp=1.3) # Allow a second plot on the same graph par(new=TRUE, par(xpd=FALSE)) colNsupp<-'blue' plot(1:10, VarEvt, type='o', pch=19, lwd=3, cex=CEX3, col='firebrick2', xlab='', ylab='', xaxt='n', yaxt='n') axis(4, at=seq(4000,10000,by=2000), labels=seq(4000,10000,by=2000), col='black', col.axis='black' ,las=0) ``` ## Difference between the median number of points scored during each event {#sec:diff-med-evt} To test the statistical significance of the apparent non-homogeneity of the decathlon scoring scheme suggested by Figure \@ref(fig:Diff-Pts-Evt), we perform a Kruskal-Wallis rank sum test to see whether there is a significant difference between the median number of points scored during each event (see below). The obtained p-value is extremely small, so there is strong statistical evidence to indicate that the median number of points differ for at least two events. ```{r, echo=FALSE} PtsEvtNames <- colnames(Decathlon)[15:24] Event <- c() Pts_Event <- c() for (i in 1:10){ Event <- c(Event, rep(PtsEvtNames[i], Nobs)) Pts_Event <- c(Pts_Event, unname(Decathlon[,14+i])) } Event <- factor(Event) kruskal.test(Pts_Event ~ Event) ``` Next we perform a Wilcoxon rank-sum test (with continuity correction) to test whether the number of points scored for the pole-vault event is larger that for the javelin-throw event. We perform the test for *paired* observations, because the same decathlete achieved both scores. The result of the test are summarized below. ```{r, echo=FALSE} #noquote(colnames(Decathlon)[c(20,23)]) noquote('Difference between the medians of Ppv and Pjt:') Wil_Pv_Jt <- wilcox.test(Decathlon[,20], Decathlon[,23], alternative='greater', paired=TRUE, exact=FALSE, correct=TRUE) noquote(sprintf('wilcox.test p-value=%.4f', Wil_Pv_Jt$p.value)) ``` The obtained p-value is extremely small so there is strong statistical evidence to suggest that the median number of points scored during the pole-vault event is larger that for the javelin-throw event. To a lesser extent, we also observe a statistically significant difference between the median number of points scored for 100m and long-jump events, even though these two look relatively close in Figure \@ref(fig:Diff-Pts-Evt)): ```{r, echo=FALSE} #noquote(colnames(Decathlon)[c(15,16)]) noquote('Difference between the medians of P100m and Plj:') Wil_100_Lj <- wilcox.test(Decathlon[,15], Decathlon[,16], alternative='two.sided', paired=TRUE, exact=FALSE, correct=TRUE) noquote(sprintf('wilcox.test p-value=%.4f', Wil_100_Lj$p.value)) ``` ## Profile of the season best performers {#sec:profil-best} To identify which events appears to be the more decisive in determining the overall winner, for each year we compute the in-season ranking for each event of the decathlete with the best overall performance, illustrated in Figure \@ref(fig:in-season-rank-top). ```{r in-season-rank-top, echo = FALSE, fig.height = 5, fig.width = 9, fig.cap = "Boxplot of the of in-season rank achieved by the season best-performers for each event of the Decathlon.", fig.align = "center"} Year_Perf_rank <- matrix(0,22,10) for (i in 0:21){ for (j in 15:24){ Season_Rank <- (apply(-Decathlon[(Decathlon$yearEvent == (1985+i)),15:24], 2, rank))[1,] Year_Perf_rank[(i+1),] <- Season_Rank } } colnames(Year_Perf_rank) <- colnames(Decathlon)[15:24] rownames(Year_Perf_rank) <- 1985:2006 options(repr.plot.width=10, repr.plot.height=7) boxplot(Year_Perf_rank, col='mediumorchid3', ylab='In-season ranking') ``` Although only a descriptive analysis, the rank-based analysis associated with Figure \@ref(fig:in-season-rank-top) seems to highlight some interesting features * The event appearing as the less decisive is the 1550m (in view of Figure \@ref(fig:in-season-rank-top)). This can in our opinion be at least partially explained by the following reasons. The 1500m is the last event of the $10$, occurring at the end of the second day, so that only the decathletes in close fight for the victory or to beat their personal record have an interest to try to perform well during this event (and they are certainly all tired by the two days of competition). The 1500m is also the only pure-resistance event (the other event including a resistance component is the 400m), so that there is no real interest for a decathlete to specialise in this event. This observation is in total agreement with the fact that the 1500m appears to be the event which is the less correlated with the others and with the total score achieved by a decathlete; see Figure \@ref(fig:correlo-pts-evt). * The event appearing as the most decisive is the the 110m hurdles; in the data, the best season performer acheived the best or second-best season performance at the 110m hurdles in $50\%$ of the cases. This might be explained by the fact that the 110m-hurdles requires very good explosivity and speed ability, combined with a strong technique and an excellent coordination; the risk of falling during the race is also very high in comparison to the other track events. * The second, third and fourth most decisive events appear to be the 100m, the long jump and the 400m, respectively. Although being a jump event, the qualities to perform well at the long jump are very similar to the qualities required to be a good sprinter (the correlation between the score achieved during the long jump and the 100m is actually relative strong, at $0.49$). These observation suggest that top-performing decathletes excel in the "speed-related" events, and perform relatively well in the other events except for the 1500m. ## Partial correlation between events {#sec:partial-corr} To further investigate the relationships between the different events of the decathlon, we perform a partial correlation analysis by computing the partial correlations between the points scored for every pair of events while controlling for all the other event. The computed partial correlations are illustrated in Figure \@ref(fig:partial-correlo-pts-evt). ```{r partial-correlo-pts-evt, echo = FALSE, fig.height = 4, fig.width = 4, fig.cap = "Graphical representation of the partial correlations between the points scored during every pairs of events while controlling for all the other event.", fig.align = "center" } AllPcor <- pcor(as.matrix(Decathlon[,c(15:24)]) )$estimate corrplot(AllPcor, method="circle") ``` Interestingly, and in comparison to the correlations shown in Figure \@ref(fig:correlo-pts-evt), if we control for all the other events, we observe only a few relatively strong partial correlations between pairs of events (all of which are significant at $\alpha=0.01$). In particular we observe that: * `P100m` and `P1500` are negatively correlated, * `P400m` and `P1500` are positively correlated, * `P400m` and `P100m` are positively correlated, * `Psp` and `Pdt` are positively correlated. The positive partial correlations for the pairs `P400m-P100m` and `P400m-P1500`, and the negative partial correlation for the pair `P1500-P100m`, might be consequence of the fact that decathletes performing well at the 400m may either be very fast or have good resistance. The relatively strong partial correlation between `Psp` and `Pdt` is also not surprising, because the shot put and discuss throw require very similar physical abilities and technical skills. # Differences between French and German decathletes {#sec:french-german} In this section, we explore the possibility of differentiating between French and German decathletes by using a logistic regression model based on the points scored during a selection of events. To this extent, we extract the entries of the data set corresponding to the points scored during the 10 events by the French and German decathletes, and gather these entries in a data set named `data_FRAxGER`; a categorical variable `IsFrench` is added to this data set, taking the value $1$ when the entry corresponds to a French decathlete, and $0$ for a German decathlete. The resulting data set has $1{,}297$ entries, with $40.17\%$ corresponding to French decathletes. Notice that no decathletes appear more than $12$ times in the extracted data set. ```{r, echo=FALSE} Ind_FRA <- (1:Nobs)[Decathlon$Nationality == 'FRA'] Ind_GER <- (1:Nobs)[Decathlon$Nationality == 'GER'] Ind_FRAxGER <- c(Ind_FRA,Ind_GER) n_FRA <- length(Ind_FRA) n_GER <- length(Ind_GER) n_FRAxGER <- n_FRA + n_GER noquote(sprintf('Number of French entries: %d', n_FRA)) noquote(sprintf('Number of German entries: %d', n_GER)) IsFrench <- rep(0,n_FRAxGER) IsFrench[1:n_FRA] <- 1 IsFrench <- factor(IsFrench) Diff_FRA <- length(unique(Decathlon$DecathleteName[Ind_FRA])) Diff_GER <- length(unique(Decathlon$DecathleteName[Ind_GER])) noquote(sprintf('Number of different French decathletes: %d', Diff_FRA)) noquote(sprintf('Number of different German decathletes: %d', Diff_GER)) data_FRAxGER <- Decathlon[Ind_FRAxGER, 15:24] data_FRAxGER$IsFrench <- IsFrench ``` To select a set of relevant events, we use the `stepAIC` function of the `MASS` package. The stepwise procedure is initialsed from the logistic model based on the points scored during the 10 events. The resulting model, named `step.model`, depends on $7$ events, as described below (where the coefficients of the resulting model are given). ```{r, echo=FALSE} full.model <- glm(IsFrench ~ Plj + Pdt + P400m + Ppv + Pjt + Phj + P110h + P1500 + Psp + P100m, family=binomial(link='logit'), data=data_FRAxGER) step.model <- full.model %>% stepAIC(trace = FALSE)
    #summary(step.model)
    noquote(‘Coefficients of step.model:’)
    print(step.model$coefficient[1:5])
    print(step.model$coefficient[6:8])
    “`
    The ROC curve corresponding to the model `step.model` is given in Figure \@ref(fig:ROC-curve). Although not especially impressive, the model appears to be able to identify some statistically significant differences between the French and German decathletes. Based on the coefficients of `step.model`, French decathletes appear, to a certain extent, to be better their German counterparts in discuss throw, pole vault, high jump and 100m (positive coefficients), while German decathletes on average seem to perform better in the 400m, 110m-hurdles and shot-put event (negative coefficients).
    “`{r ROC-curve, echo = FALSE, fig.height = 3, fig.width = 3, fig.cap = “ROC curve for the logistic regression model \\texttt{step.model}; the orange dashed lines correspond to the sensiblility and specificity of the resulting binary classifier for a decision treshold at $40\\%$.”, fig.align = “center”}
    confusion.glm <- function(data, model, tresh){ prediction <- ifelse(predict(model, data, type='response') > tresh, TRUE, FALSE)
    confusion <- table(prediction, as.logical(model$y)) confusion <- cbind(confusion, c(1 - confusion[1,1]/(confusion[1,1]+confusion[2,1]), 1 - confusion[2,2]/(confusion[2,2]+confusion[1,2]))) confusion <- as.data.frame(confusion) names(confusion) <- c('FALSE', 'TRUE', 'class.error') confusion } prob <- predict(step.model, type='response') data_FRAxGER$prob <- prob ROC_curve <- roc(IsFrench ~ prob, data=data_FRAxGER, levels=c(0,1), direction='<') options(repr.plot.width=7, repr.plot.height=7) plot(ROC_curve) # Confusion matrix at a given decision treshold Confu_T <- confusion.glm(data_FRAxGER, step.model, 0.4) SensitivityT <- Confu_T[2,2]/sum(Confu_T[,2]) SpecificityT <- Confu_T[1,1]/sum(Confu_T[,1]) abline(v=SpecificityT, col='orange', lwd=2, lty=2) abline(h=SensitivityT , col='orange', lwd=2, lty=2) ``` With a decision threshold set at $40\%$, the binary classifier based on the logistic model `setp.model` correctly classifies $69.29\%$ of the French decathletes, and $67.27\%$ of the German decathletes, as reported below (see also Figure \@ref(fig:ROC-curve)). Note that the threshold was set at $40\%$ to obtain balanced percentage of classification errors. ```{r, echo=FALSE} noquote('Confusion matrix for step.model with 40% decision treshold:') print(Confu_T) ``` Note also that in view of their Z-values, the long jump and the 110m hurdles events appear less influent than the other variables. Thus we removed these two predictors from the model, however the conclusions drawn from an analysis of the reduced model are similar to the conclusions based on `step.model`. ```{r, echo=FALSE} noquote('p-values for the Z-values of step.model:') print(coef(summary(step.model))[,4][1:5]) print(coef(summary(step.model))[,4][6:8]) ``` # Conclusion In this report, we have used various statistical tools to explore the `Decathlon` data set. We performed some *correlation analyses*, both parametric and non-parametric, and some *tests on means and variances* using the $t$-test, $F$-test and ANOVA, together with various tests for checking conditions. We also performed some *non-paremetric tests* for the median of certain quantities, including the Wilcoxon and Kruskal-Wallis rank-sum tests, and computed *confidence intervals* for certain means and variances. In addition, we computed and discussed some *linear regression* and *logistic regression* models, and used various *graphical representation tools* to illustrate the data and our findings. For the parametric tests, sample sizes were generally large enough to ensure the *validity of the normal approximation framework*. Our main observations can be summarised as follows: 1. the best season performances appear to increase with the years, while the mean and median performances appear to decrease over the same period (Section \@ref(sec:perf-year)); 2. the year 1988 is relatively "special", but not in statistically significant way (Section \@ref(sec:perf-year)); 3. the current scoring scheme of the decathlon is not homogeneous (Section \@ref(sec:diff-evts)); 4. the best decathletes seem, in general, to be those who outperform the others in "pure-speed" events, including the long-jump (Section \@ref(sec:diff-evts)); 5. there exist relatively strong partial correlations between groups of events (Section \@ref(sec:diff-evts)); 6. certain countries seem, to a certain extent, to yield decathletles with specific profiles (Section \@ref(sec:french-german)). These observations are in our opinion interesting and appear to be statistically significant. Our conclusions should nevertheless be checked and refined by further analyses, potentially using other sources of data. # Appendix ```{r pairs-throw-evt, echo = FALSE, fig.height = 6, fig.width = 6, fig.cap = "Scatter plots, histograms and correlations for the three throw events.", fig.align = "center" } panel.hist <- function(x, ...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, plot = FALSE) breaks <- h$breaks; nB <- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...) } panel.cor <- function(x, y, ...) { par(usr = c(0, 1, 0, 1)) txt <- as.character(format(cor(x, y), digits=2)) text(0.5, 0.5, txt, cex = 6* abs(cor(x, y))) } options(repr.plot.width=8, repr.plot.height=8) options(repr.plot.width=8, repr.plot.height=8) pairs(Decathlon[c(17,22,23)], main = "Throw events", pch = 21, bg = 'coral2', upper.panel=panel.cor, diag.panel=panel.hist) ```

    A study of the Decathlon datase

    t

    A Student

    2

    0 February 202

    1

    Abstract

    We demonstrate how various descriptive and inferenrial statistical methods can be applied to the
    Decathlon dataset, and how their results might be interpreted and presented. We study the evolution of
    athlete performance as function of time, and in show that while the best performances appear to increase
    with time, the mean and median performances appear to decrease over the same period. We illustrate
    the non-homogeneity of the current decathlon scoring scheme, and give some insight into the profile of
    the best-performing decathletes. We also perform a correlation analysis to explore relationships between
    the different events of the decathlon, and finally present the results of a logistic-regresion analysis and
    demonstrate that to a certain extent it is possible to distinguish between French and German decathletes
    by the scores the achieve on a subset of the decathlon events.

    Contents

    1

  • Introduction
  • 2

    2

  • Performance across years
  • 2

    2.1 Evolution of performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    4

    2.2 Comparison of mean season performances . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.

    3

    Comparison between 1

    9

    8

    8 and 199

    6

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    5

    3

  • Differences between events
  • 6

    3.1 Difference between the median number of points scored during each event . . . . . . . . . . 6

    3.2 Profile of the season best performers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    7

    3.3 Partial correlation between events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    4

  • Differences between French and German decathletes
  • 1

    0

    5

  • Conclusion
  • 1

    1

    6

  • Appendix
  • 12

    1

    1 Introduction

    The Decathlon data set records the performances of elite decathletes in international competitions over
    the period from 1985 to 2006. The decathlon is a combined event in athletics consisting of ten track-and-
    field events; the current decathlon world-record holder is the French decathlete Kevin Mayer, who achieved
    a score of 9,126 points in 2018. One may refer to the following Wikipedia page for more information of the
    decathlon: https://en.wikipedia.org/wiki/Decathlon.

    The data set consists of 7,968 observations of 24 variables. These names of the variables are listed hereafter:

    ## [1] Variables names:

    ## [1] “Totalpoints” “DecathleteName” “Nationality” “m

    10

    0″
    ## [5] “Longjump” “Shotput” “Highjump” “m400”
    ## [9] “m

    11

    0hurdles” “Discus” “Polevault” “Javelin”
    ## [

    13

    ] “m1500” “yearEvent” “P100m” “Plj”
    ## [17] “Psp” “Phj” “P400m” “P110h”
    ## [21] “Ppv” “Pdt” “Pjt” “P1500”

    An entry of the data set consists of the total number of points scored by a decathlete, the name and nationality
    of the decathlete, and the year the performance was achieved. The raw performances for each of the 10
    events are reported (with time in seconds and distance or height in meters), together with the number of
    points scored for these events. There are 2,709 different decathletes of 107 different nationalities in the data
    set.

    Remark. For 435 entries of the data set, we observed a difference of 1 between the variable Totalpoints,
    corresponding to the total number of points scored by a decathlete for that performance, and the sum of the
    points scored during the 10 events. We decided to apply a correction to the corresponding entries of the
    variable Totalpoints, so that they are all equal to the sum of the points scored during the 10 events.

    # Correction of `Totalpoints`
    Decathlon$Totalpoints <- rowSums(Decathlon[,15:24])

    In our study we pay a particular attention to the total number of points, the points scored during each event,
    the nationality of the decathletes, and the year the performances were achieved. The Pearson correlations
    between the total number of points and the points scored during the various events is illustrated by the
    correlogram shown in Figure 1, while scatter plots and histograms showing the points scored during the
    three throw events (shot-put, discus and javelin throw) are presented in Figure 8 (see appendix).

    2 Performance across years

    In this section, we investigate the evolution of the overall performance (variable Totalpoints) as func-
    tion of the year these performances were achieved (variable yearEvent). The number of obsevations for
    each year (or season) varies between 321 and 399. A graphical representation of the evolution of the best,
    mean and median preformace as a function of the year is shown in Figure 2.

    2

    https://en.wikipedia.org/wiki/Decathlon

    −1

    −

    0

    .8

    −

    0.6

    −

    0.4

    −

    0.2

    0
    0.2
    0.4
    0.6
    0.8
    1

    To
    ta

    lp
    o

    in
    ts

    P
    1

    0
    0

    m

    P

    l

    j

    P

    sp
    P

    h

    j

    P
    4

    0
    0
    m
    P

    1
    1

    0
    h

    P
    p

    v

    P

    d
    t

    P
    jt

    P
    1

    5
    0

    0

    Totalpoints

    P100m

    Plj

    Psp

    Phj

    P400m

    P110h

    Ppv

    Pdt

    Pjt

    P15

    00

    Figure 1: Graphical representation of the Pearson correlations between the total number of points and the
    points scored during each event.

    1985 1990 1995 2000 20

    05

    8
    5

    0
    0

    8
    6

    0
    0

    8
    7

    0
    0

    8
    8

    0
    0

    8
    9

    0
    0

    9
    0

    0
    0

    Year

    To
    ta

    l p
    o

    in
    ts

    Best

    1985 1990 1995 2000 2005

    7
    2

    0
    0
    7
    2
    5
    0

    7
    3

    0
    0
    7
    3
    5
    0

    7
    4

    0
    0
    Year
    To
    ta
    l p
    o
    in
    ts

    Mean

    Median

    Figure 2: Evolution of the best, mean and median performance as a function of the year. In each case, the
    regression line (dashed black line) is shown togeter with the corresponding 95% confidence intervals (purple
    dashed curves; prediction without noise term).

    3

    2.1 Evolution of performances

    Figure 2 suggests that the best season performance increases with time, while the mean and median season
    performances appear to decrease with time. This observation is confirmed by tests of Spearman’s rank
    correlation between the time variables and the best, mean and median season performances. The results are
    shown in Table 1, where we observe

    • for the best season performance a positive Spearman correlation of 0.398, statistically significant at
    the significance level α = 0.05;

    • for the mean season performance a negative Spearman correlation of −0.551, statistically significant
    at the significance level α = 0.01;

    • for the median season performance a negative Spearman correlation of -0.581, statistically significant
    at the significance level α = 0.01.

    Table 1: Result of the Spearman’s rank correlation tests between the year and the season’s best, mean and
    median performances.

    performance alternative p.value estimate

    Best greater 0.0337857 0.39808

    02

    Mean less 0.0044491 -0.5505364

    Median less 0.0022965 -0.5807911

    2.2 Comparison of mean season performances

    Figure 3 gives an overview of the sample mean and sample variance of the performances achieved each year
    as represented by the Totalpoints variable; the corresponding confidence intervals (at the confidence
    level 70%) are also presented. We observe that 1988 has the largest sample mean and the second-largest
    sample variance among all years, while 1996 has the second-largest sample mean and the largest sample
    variance. Interestingly, 1988 and 1996 were both Olympic years, with the 1988 Games now infamous for
    many proven doping cases. We also observe a relatively strong overlap between the confidence intervals
    corresponding to these two years. The number of observations for each year are relatively large (at least 321
    observations), so by the central limit theorem the normal approximations used to compute these intervals
    are likely to be reasonably good.

    Three different tests for homogeneity of variances return p-values between 0.052 and 0.102, as shown in
    Table 2. These results indicate that there is no strong statistical evidence to suggest that the variance of
    athlete performances differs across years: for each of the three tests we do not reject the null hypothesis of
    homoscedasticity at significance level α = 0.05.

    Table 2: Tests for homogeneity of variance for the performances (i.e. scores achieved by the decathletes)
    achieved each year.

    Test p.value

    Bartlett 0.0521832
    Fligner-Killeen 0.0575033

    Levene 0.1015971

    4

    7300 7350 7400 7450

    1
    4

    0
    0
    0
    0

    1
    8

    0
    0
    0
    0
    Mean

    V
    a

    ri
    a

    n
    ce

    85

    86

    87

    88

    89

    90

    91

    92

    93

    94
    95

    96

    97
    98

    99

    00

    01

    02

    03

    04

    05

    06

    Figure 3: Sample mean and sample variance for the performances achieved each season, and correponding
    confidence intervals at the confidence level 70%. The horizontal intervals corresponds to the means (in
    orange), and the vertical ones to the variances (in green).

    Next we apply a one-way ANOVA to see whether there is a significant difference between the mean perfor-
    mance achieved each year, and find that it returns a p-value smaller than 0.0002 (see below), which suggests
    that there is a statistically significant difference between the mean performance for at least two years. Notice
    that the number of observations for each year is roughly the same, so the data are approximately balanced.
    Tests indicate that the data are not normally distributed, however the number of observations in each year
    group is relatively large (at least than 321), so by the central limit theorem the conclusion of this ANOVA
    is likely to be valid. We can also report that ANOVA performed with the oneway.test function (which
    does not assume assuming equality of variance) returns a p-value of similar magnitude.

    AOV1 <- aov(Totalpoints ~ yearEvent_AsFactor, data=Decathlon) print(summary(AOV1))

    ## Df Sum Sq Mean Sq F value Pr(>F)
    ## yearEvent_AsFactor 21 8.894e+06 423547 2.487 0.000184 ***
    ## Residuals 7946 1.353e+09 170302
    ## —
    ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

    2.3 Comparison between 1988 and 1996

    In Figure 2 we observe that the sample mean of the overall performances for 1988 is the largest among all
    years. (A similar observation holds for the sample median, while 1988 interestingly also has the the lowest

    5

    season’s best performance.) We now test to see whether there is a statistically significant difference between
    the mean performances for 1988 and 1996, the year having the second-largest sample mean of the overall
    performance.

    ## [1] F-test (compararison of variances); p-value=0.7034

    ## [1] t-test (compararison of means); p-value=0.4546

    The F -test for equality of variances indicates that there is no statistical evidences to suggest a difference
    between the variance of the two samples. The two-sample t-test then indicates that there is no statistical evi-
    dence to suggest a difference between the mean of the two samples. Note that that although the two samples
    might not be normally distributed, the relatively large sample sizes ensure that normal approximation will
    work well here.

    3 Differences between events

    Figure 4 illustrates the sample mean, sample median and sample variance of the number of points scored
    in each of the 10 decathlon events, where we observe that the scheme for awarding points does not appear
    to be homogeneous. For example, decathletes on average seem to score more points for the 110m hurdles
    than for the javelin event, and the variance of the points scored for the pole vault appears to be significantly
    larger that the variance of the points scored for the 100m.

    6
    5

    0
    7

    0
    0

    7
    5

    0
    8

    0
    0

    P100m Plj Psp Phj P400m P110h Ppv Pdt Pjt

    P1500

    Mean
    Median

    Variance

    4
    0

    0
    0

    6
    0

    0
    0

    8
    0

    0
    0

    1
    0

    0
    0
    0

    Figure 4: Sample mean, median and variance od the number of points scored during each event. Values for
    the mean and the median read on the left-hand-side axis, and on the right-hand-side axis for the variance.

    3.1 Difference between the median number of points scored during each event

    To test the statistical significance of the apparent non-homogeneity of the decathlon scoring scheme sug-
    gested by Figure 4, we perform a Kruskal-Wallis rank sum test to see whether there is a significant difference
    between the median number of points scored during each event (see below). The obtained p-value is ex-
    tremely small, so there is strong statistical evidence to indicate that the median number of points differ for
    at least two events.

    6

    ##
    ## Kruskal-Wallis rank sum test
    ##
    ## data: Pts_Event by Event
    ## Kruskal-Wallis chi-squared = 34243, df = 9, p-value < 2.2e-16

    Next we perform a Wilcoxon rank-sum test (with continuity correction) to test whether the number of points
    scored for the pole-vault event is larger that for the javelin-throw event. We perform the test for paired
    observations, because the same decathlete achieved both scores. The result of the test are summarized
    below.

    ## [1] Difference between the medians of Ppv and Pjt:

    ## [1] wilcox.test p-value=0.0000

    The obtained p-value is extremely small so there is strong statistical evidence to suggest that the median
    number of points scored during the pole-vault event is larger that for the javelin-throw event. To a lesser
    extent, we also observe a statistically significant difference between the median number of points scored for
    100m and long-jump events, even though these two look relatively close in Figure 4):

    ## [1] Difference between the medians of P100m and Plj:

    ## [1] wilcox.test p-value=0.0226

    3.2 Profile of the season best performers

    To identify which events appears to be the more decisive in determining the overall winner, for each year we
    compute the in-season ranking for each event of the decathlete with the best overall performance, illustrated
    in Figure 5.

    Although only a descriptive analysis, the rank-based analysis associated with Figure 5 seems to highlight
    some interesting features

    • The event appearing as the less decisive is the 1550m (in view of Figure 5). This can in our opinion be
    at least partially explained by the following reasons. The 1500m is the last event of the 10, occurring
    at the end of the second day, so that only the decathletes in close fight for the victory or to beat their
    personal record have an interest to try to perform well during this event (and they are certainly all
    tired by the two days of competition). The 1500m is also the only pure-resistance event (the other
    event including a resistance component is the 400m), so that there is no real interest for a decathlete
    to specialise in this event. This observation is in total agreement with the fact that the 1500m appears
    to be the event which is the less correlated with the others and with the total score achieved by a
    decathlete; see Figure 1.

    • The event appearing as the most decisive is the the 110m hurdles; in the data, the best season performer
    acheived the best or second-best season performance at the 110m hurdles in 50% of the cases. This
    might be explained by the fact that the 110m-hurdles requires very good explosivity and speed ability,
    combined with a strong technique and an excellent coordination; the risk of falling during the race is
    also very high in comparison to the other track events.

    7

    P100m Plj Psp Phj P400m P110h Ppv Pdt Pjt P1500

    0
    5

    0
    1

    0
    0

    1
    5

    0
    2

    0
    0

    2
    5

    0
    3

    0
    0

    In
    −

    se
    a

    so
    n

    r
    a

    n
    ki

    n
    g

    Figure 5: Boxplot of the of in-season rank achieved by the season best-performers for each event of the
    Decathlon.

    • The second, third and fourth most decisive events appear to be the 100m, the long jump and the
    400m, respectively. Although being a jump event, the qualities to perform well at the long jump are
    very similar to the qualities required to be a good sprinter (the correlation between the score achieved
    during the long jump and the 100m is actually relative strong, at 0.49).

    These observation suggest that top-performing decathletes excel in the “speed-related” events, and perform
    relatively well in the other events except for the 1500m.

    3.3 Partial correlation between events

    To further investigate the relationships between the different events of the decathlon, we perform a partial
    correlation analysis by computing the partial correlations between the points scored for every pair of events
    while controlling for all the other event. The computed partial correlations are illustrated in Figure 6.

    Interestingly, and in comparison to the correlations shown in Figure 1, if we control for all the other events,
    we observe only a few relatively strong partial correlations between pairs of events (all of which are signifi-
    cant at α = 0.01). In particular we observe that:

    • P100m and P1500 are negatively correlated,
    • P400m and P1500 are positively correlated,
    • P400m and P100m are positively correlated,
    • Psp and Pdt are positively correlated.

    The positive partial correlations for the pairs P400m-P100m and P400m-P1500, and the negative partial
    correlation for the pair P1500-P100m, might be consequence of the fact that decathletes performing well

    8

    −1

    −0.8

    −0.6

    −0.4

    −0.2

    0
    0.2
    0.4
    0.6
    0.8
    1
    P
    1
    0
    0
    m

    P
    lj

    P
    sp

    P
    h

    j
    P
    4
    0
    0
    m
    P
    1
    1
    0
    h
    P
    p
    v

    P
    d

    t
    P
    jt
    P
    1
    5
    0
    0
    P100m
    Plj
    Psp
    Phj
    P400m
    P110h
    Ppv
    Pdt
    Pjt
    P1500

    Figure 6: Graphical representation of the partial correlations between the points scored during every pairs
    of events while controlling for all the other event.

    9

    at the 400m may either be very fast or have good resistance. The relatively strong partial correlation between
    Psp and Pdt is also not surprising, because the shot put and discuss throw require very similar physical
    abilities and technical skills.

    4 Differences between French and German decathletes

    In this section, we explore the possibility of differentiating between French and German decathletes by
    using a logistic regression model based on the points scored during a selection of events. To this extent, we
    extract the entries of the data set corresponding to the points scored during the 10 events by the French and
    German decathletes, and gather these entries in a data set named data_FRAxGER; a categorical variable
    IsFrench is added to this data set, taking the value 1 when the entry corresponds to a French decathlete,
    and 0 for a German decathlete. The resulting data set has 1,297 entries, with 40.17% corresponding to
    French decathletes. Notice that no decathletes appear more than 12 times in the extracted data set.

    ## [1] Number of French entries: 521

    ## [1] Number of German entries: 776

    ## [1] Number of different French decathletes: 151

    ## [1] Number of different German decathletes: 247

    To select a set of relevant events, we use the stepAIC function of the MASS package. The stepwise pro-
    cedure is initialsed from the logistic model based on the points scored during the 10 events. The resulting
    model, named step.model, depends on 7 events, as described below (where the coefficients of the result-
    ing model are given).

    ## [1] Coefficients of step.model:

    ## (Intercept) Pdt P400m Ppv Phj
    ## 2.425113260 0.005761762 -0.005195875 0.004739631 0.001693611

    ## P110h Psp P100m
    ## -0.001676468 -0.015800780 0.005794585

    The ROC curve corresponding to the model step.model is given in Figure 7. Although not especially
    impressive, the model appears to be able to identify some statistically significant differences between the
    French and German decathletes. Based on the coefficients of step.model, French decathletes appear, to
    a certain extent, to be better their German counterparts in discuss throw, pole vault, high jump and 100m
    (positive coefficients), while German decathletes on average seem to perform better in the 400m, 110m-
    hurdles and shot-put event (negative coefficients).

    With a decision threshold set at 40%, the binary classifier based on the logistic model setp.model cor-
    rectly classifies 69.29% of the French decathletes, and 67.27% of the German decathletes, as reported below
    (see also Figure 7). Note that the threshold was set at 40% to obtain balanced percentage of classification
    errors.

    10

    Specificity

    S
    e

    n
    si

    tiv
    ity

    1.0 0.6 0.2
    0

    .0
    0

    .4
    0

    .8

    Figure 7: ROC curve for the logistic regression model step.model; the orange dashed lines correspond
    to the sensiblility and specificity of the resulting binary classifier for a decision treshold at 40%.

    ## [1] Confusion matrix for step.model with 40% decision treshold:

    ## FALSE TRUE class.error
    ## FALSE 522 160 0.3273196
    ## TRUE 254 361 0.3071017

    Note also that in view of their Z-values, the long jump and the 110m hurdles events appear less influent than
    the other variables. Thus we removed these two predictors from the model, however the conclusions drawn
    from an analysis of the reduced model are similar to the conclusions based on step.model.

    ## [1] p-values for the Z-values of step.model:

    ## (Intercept) Pdt P400m Ppv Phj
    ## 3.483006e-02 4.743905e-07 3.990390e-05 1.665840e-11 7.215314e-02

    ## P110h Psp P100m
    ## 1.481284e-01 2.258318e-29 3.469760e-05

    5 Conclusion

    In this report, we have used various statistical tools to explore the Decathlon data set. We performed some
    correlation analyses, both parametric and non-parametric, and some tests on means and variances using the
    t-test, F -test and ANOVA, together with various tests for checking conditions. We also performed some
    non-paremetric tests for the median of certain quantities, including the Wilcoxon and Kruskal-Wallis rank-
    sum tests, and computed confidence intervals for certain means and variances. In addition, we computed and

    11

    discussed some linear regression and logistic regression models, and used various graphical representation
    tools to illustrate the data and our findings. For the parametric tests, sample sizes were generally large
    enough to ensure the validity of the normal approximation framework.

    Our main observations can be summarised as follows:

    1. the best season performances appear to increase with the years, while the mean and median perfor-
    mances appear to decrease over the same period (Section 2);

    2. the year 1988 is relatively “special”, but not in statistically significant way (Section 2);
    3. the current scoring scheme of the decathlon is not homogeneous (Section 3);
    4. the best decathletes seem, in general, to be those who outperform the others in “pure-speed” events,

    including the long-jump (Section 3);
    5. there exist relatively strong partial correlations between groups of events (Section 3);
    6. certain countries seem, to a certain extent, to yield decathletles with specific profiles (Section 4).

    These observations are in our opinion interesting and appear to be statistically significant. Our conclusions
    should nevertheless be checked and refined by further analyses, potentially using other sources of data.

    6 Appendix

    12

    Psp
    4
    0

    0
    6

    0
    0
    8
    0
    0

    400 600 800

    400 600 800

    0.72

    Pdt
    4
    0
    0
    6
    0
    0
    8
    0
    0

    0.44

    0.42

    400 600 800 1000

    4
    0
    0
    6
    0
    0
    8
    0
    0
    1
    0
    0
    0
    Pjt

    Throw events

    Figure 8: Scatter plots, histograms and correlations for the three throw events.

    13

      Introduction
      Performance across years
      Evolution of performances
      Comparison of mean season performances
      Comparison between 1988 and 1996
      Differences between events
      Difference between the median number of points scored during each event
      Profile of the season best performers
      Partial correlation between events
      Differences between French and German decathletes
      Conclusion
      Appendix

    What Will You Get?

    We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.

    Premium Quality

    Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.

    Experienced Writers

    Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.

    On-Time Delivery

    Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.

    24/7 Customer Support

    Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.

    Complete Confidentiality

    Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.

    Authentic Sources

    We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.

    Moneyback Guarantee

    Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.

    Order Tracking

    You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.

    image

    Areas of Expertise

    Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

    Areas of Expertise

    Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

    image

    Trusted Partner of 9650+ Students for Writing

    From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.

    Preferred Writer

    Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.

    Grammar Check Report

    Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.

    One Page Summary

    You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.

    Plagiarism Report

    You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.

    Free Features $66FREE

    • Most Qualified Writer $10FREE
    • Plagiarism Scan Report $10FREE
    • Unlimited Revisions $08FREE
    • Paper Formatting $05FREE
    • Cover Page $05FREE
    • Referencing & Bibliography $10FREE
    • Dedicated User Area $08FREE
    • 24/7 Order Tracking $05FREE
    • Periodic Email Alerts $05FREE
    image

    Our Services

    Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.

    • On-time Delivery
    • 24/7 Order Tracking
    • Access to Authentic Sources
    Academic Writing

    We create perfect papers according to the guidelines.

    Professional Editing

    We seamlessly edit out errors from your papers.

    Thorough Proofreading

    We thoroughly read your final draft to identify errors.

    image

    Delegate Your Challenging Writing Tasks to Experienced Professionals

    Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!

    Check Out Our Sample Work

    Dedication. Quality. Commitment. Punctuality

    Categories
    All samples
    Essay (any type)
    Essay (any type)
    The Value of a Nursing Degree
    Undergrad. (yrs 3-4)
    Nursing
    2
    View this sample

    It May Not Be Much, but It’s Honest Work!

    Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.

    0+

    Happy Clients

    0+

    Words Written This Week

    0+

    Ongoing Orders

    0%

    Customer Satisfaction Rate
    image

    Process as Fine as Brewed Coffee

    We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.

    See How We Helped 9000+ Students Achieve Success

    image

    We Analyze Your Problem and Offer Customized Writing

    We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.

    • Clear elicitation of your requirements.
    • Customized writing as per your needs.

    We Mirror Your Guidelines to Deliver Quality Services

    We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.

    • Proactive analysis of your writing.
    • Active communication to understand requirements.
    image
    image

    We Handle Your Writing Tasks to Ensure Excellent Grades

    We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.

    • Thorough research and analysis for every order.
    • Deliverance of reliable writing service to improve your grades.
    Place an Order Start Chat Now
    image

    Order your essay today and save 30% with the discount code Happy