Would you be able to do the below attached assignment which is R language data science.
MAT0
2
2 Foundations of Statistics and Data Science
Summative Assessment 2020/2
1
MAT022 Foundations of Statistics and Data Science
Summative Assessment 2020/21
Summative assessment for the module is by means of a single report on your statistical anal-
ysis of data related to the National Basketball Association (NBA), a menâs professional
basketball league in the USA.
This form of assessment has been chosen because as professional statisticians and data scientists,
you will often be asked to investigate a data set and report on whether it contains anything useful
or interesting. The assessment will also help you to prepare for writing your MSc dissertation
in the summer.
Assessment type Weight Max. length Format Deadline
Report 100% 10 pages R Markdown and PDF Tue 09 Feb 2021
Your report will be assessed according to how well you are able to
⢠analyse the data set,
4
0%
⢠interpret the results of your analysis, and
3
0%
⢠present the results of your analysis and your interpretation of the data set. 30%
Your analysis should be performed using the R statistical software package, and your report
prepared using the R Markdown typesetting system and the template provided.
1
You are asked to write a report on data from the 2014â1
5
season of the National Basketball
Association (NBA), a menâs professional basketball league in the USA. The data set is a record
of all shots taken by players in the NBA between October 2014 and March 2015, and consists
of 128,069 observations on 23 variables as described in Table 1.
Please submit your report in PDF format, together with the R Markdown file used to generate
the report, through Learning Central sometime before 12.00 on Tuesday 09 February 2021.
1
MAT022 Foundations of Statistics and Data Science Summative Assessment 2020/21
Variable Description
GAME_ID Unique id number of the game.
DATE Date of the game.
HOME_TEAM Team playing at home.
AWAY_TEAM Team playing away from home.
PLAYER_NAME Name of the shooting player.
PLAYER_ID Unique id number of the shooting player.
LOCATION Whether the player was on the home (H) or away (A) team.
W Whether the playerâs team won (W) or lost (L) the game.
FINAL_MARGIN The margin of victory for the playerâs team (negative means defeat).
SHOT_NUMBER The number of the shot taken by the shooting player in that game.
PERIOD The period of the game that the shot was taken.
GAME_CLOCK The time remaining in the period when the shot was taken.
SHOT_CLOCK The time remaining on the shot clock.
DRIBBLES Number of dribbles by the player before the shot was taken.
TOUCH_TIME The time that the ball was in the shooting playerâs hand.
SHOT_DIST The distance of the shooting player from the basket.
PTS_TYPE 2 for shots from inside the arc, 3 for shots from outside the arc.
SHOT RESULT Whether the shot was successful (âmadeâ) or unsuccessful (âmissedâ)
CLOSEST_DEFENDER Name of the nearest defender when the shot was taken.
CLOSEST_DEFENDER_ID Unique id number of the nearest defender.
CLOSE_DEF_DIST Distance of the nearest defender when the shot was taken.
FGM Equal to 1 if the shot was made (scored) otherwise 0.
PTS The number of points scored (0, 2 or 3)
Table 1: Description of the variables in the NBA shot logs data set.
ATL Atlanta Hawks
BKN Brooklyn Nets
BOS Boston Celts
CHA Charlotte Hornets
CHI Chicago Bulls
CLE Cleveland Cavaliers
DAL Dallas Mavericks
DEN Denver Nuggets
DET Detroit Pistons
GSW Golden State Warriors
HOU Houston Rockets
IND Indiana Pacers
LAC Los Angeles Clippers
LAL Los Angeles Lakers
MEM Memphis Grizzlies
MIA Miami Heat
MIL Milwaukee Bucks
MIN Minnesota Timberwolves
NOP New Orleans Pelicans
NYK New York Knicks
OKC Oklahoma City Thunder
ORL Orlando Magic
PHI Philadelphia 76ers
PHX Phoenix Suns
POR Portland Trail Blazers
SAC Sacramento Kings
SAS San Antonio Spurs
TOR Toronto Raptors
UTA Utah Jazz
WAS Washington Wizards
Table 2: Acronyms for the teams in the NBA.
2
MAT022 Foundations of Statistics and Data Science Summative Assessment 2020/21
2
The ability to write clearly and concisely is an important professional competence. To encourage
writing that is brief and to the point, your reports are limited to a maximum of 10 pages. It is
often far more difficult to express yourself in 100 words than in 1000 words, especially when you
have a lot to say, so be careful not to underestimate the challenge posed by this restriction. The
modest page limit will also encourage you to be selective in the results you choose to present.
A suggested structure for your report is shown in Table 3. Note that the title page, abstract,
table of contents, list of references and appendix do not contribute towards the page count.
Title 1 page
Abstract 100 words
Table of contents â
1. Introduction 1/2 page
2. Background 1 page
3. (descriptive analysis) 2 pages
4. (inferential analysis) 2 â 3 pages
5. (inferential analysis) 2 â 3 pages
6. (inferential analysis) 2 â 3 pages
7. Conclusion 1/2 page
References â
Appendices 2 pages max.
Table 3: Suggested report structure
⢠The title page should contain the title of your report, your name and student number,
and the date on which your report was completed.
⢠The abstract should contain a short summary of the report and its main conclusions.
⢠The table of contents should list the number and title of each section against the number
of the page on which the section begins.
⢠The introduction should consist of a few short paragraphs, describing the purpose of the
report and providing a brief outline of its contents.
⢠The background section should include a brief review of any relevant literature, and
provide a context for the work presented in the report.
⢠The report should contain a relatively short section on a descriptive analysis of the data
set, with a title chosen to reflect what the section contains.
⢠The main part of the report should consist of two or three sections on different inferential
analyses of the data set. Here you should formulate hypotheses, conduct statistical tests,
then present and discuss the results of these tests. The titles of these sections should
reflect what the sections contain.
⢠The conclusion should consist of a few short paragraphs, providing a summary of the
report and a brief outline of some ideas for future work.
⢠The report may contain a single appendix for large figures and tables, limited to a
maximum of two pages.
3
MAT022 Foundations of Statistics and Data Science Summative Assessment 2020/21
3
Detailed assessment criteria are shown in Table 4.
Level Analysis
(40%)
Discussion
(30%)
Presentation
(30%)
Distinction
(70â100)
Hypotheses are inter-
esting and original.
Methods are appro-
priate and applied
carefully and precisely.
An interesting de-
scriptive analysis is
included and reported
correctly.
Inferences are valid
and supported by
evidence. Original and
interesting conclusions
are articulated. There
is some shrewd spec-
ulation about possible
causal factors.
A high standard of
writing is maintained
throughout. The nar-
rative is clear, coher-
ent, eloquent and re-
fined. Figures and
tables are used cre-
atively.
Merit
(60â69)
Hypotheses are formu-
lated correctly. Meth-
ods are appropriate
and applied correctly.
A moderately interest-
ing descriptive analy-
sis is included and re-
ported correctly.
Inferences are valid
and supported by
evidence. Interesting
conclusions are artic-
ulated. There is some
speculation about
possible causal factors.
A good standard or
writing is maintained
throughout. The nar-
rative is clear and co-
herent. Figures and ta-
bles are used to illus-
trate the narrative.
Pass
(50â59)
Hypotheses are for-
mulated correctly.
Methods are applied
correctly for the most
part. A descriptive
analysis is included
and reported correctly.
Inferences are mostly
valid and supported
by some evidence.
Some relatively inter-
esting conclusions are
articulated.
An acceptable stan-
dard of writing is
maintained through-
out. The narrative
is lacklusture and
sometimes unclear.
Figures and tables do
not always illustrate
the narrative.
Fail
(0â49)
The analysis is bland
and almost entirely de-
scriptive.
Inferences are invalid
or not supported by ev-
idence. There is little
of any interest.
The report is poorly
written. The narrative
is disjointed and hard
to follow.
Table 4: Assessment criteria
Plagiarism
You may find existing studies of the NBA data set online. Plagiarism is to present other peopleâs
work or ideas as your own, by incorporating it into your work without full acknowledgement.
The need to acknowledge othersâ work applies not only to text, but also to computer code,
figures, tables etc. You must also attribute text, data, or other resources downloaded from
websites. Following submission your report will be analysed by the TurnitIn software, and any
report in which plagiarism is detected will receive a mark of zero.
4
MAT022 Foundations of Statistics and Data Science Summative Assessment 2020/21
4
The golden rule when writing is to always think of the reader. For scientific reports, readers
will typically want to read something interesting and learn something in the process.
What do we mean by interesting?
Not interesting The average exam mark of statistics and data science students.
Quite interesting The average mark of male students, the average mark of female students,
and the results of a test of whether any difference is statistically significant.
Very interesting The average mark of male students, the average mark of female students, a
statistical test of whether any difference is significant, and some speculation
about why there is a significant difference, or alternatively why there is not.
Audience. The target audience for your report is this yearâs cohort students on the Founda-
tions of Statistics and Data Science module, so you can assume that your readers are familiar
with the methods and terminology established within the lectures and notebooks. If you choose
to use methods that have not been covered in lectures, you must ensure that any new terms are
properly defined and references to the relevant literature included.
Analysis. The reader shoud be satisfied that you have performed your analysis correctly,
and in particular that you have verified the conditions that are necessary to apply the various
methods. Your methods should be introduced with a brief summary of their main features, but
technical details should not be discussed at length although you might consider providing the
interested reader with references to the relevant literature.
Navigation. Do not assume that the reader will read the report from start to finish, as one
might read a novel. Reports should be made easy to navigate using numbered sections and
subsections together with cross-referencing. Once you have written a first draft, it will need
careful editing before it becomes a coherent and polished report. This stage always takes longer
than you think!
Scientific writing. For scientific reports we aim for a style of writing that is clear and concise.
Make sure that sentences are unambiguous and that a good standard of writing is maintained
throughout the report.
⢠Sections should not start abruptly with the subject matter, but rather with an introductory
sentence or short paragraph. Sections should also end with concluding sentence or short
paragraph.
⢠All figures and tables must be numbered and have captions. Figures or tables that are not
mentioned at least once in the text should not be included.
⢠A qualified statement is one that express some level of uncertainty about its own accuracy,
and should always be used when drawing conclusions from the results of a statistical
analysis, and especially when speculating about possible causal factors. Common phrases
that indicate qualified statements include âThis suggests that …â, âIt appears that …â,
âWe might conclude that …â, âThere is some evidence to indicate …â and so on.
5
—
title: “A study of the Decathlon dataset”
author: “A Student”
email: “StudentA@cardiff.ac.uk”
date: “`r format(Sys.time(), ‘%d %B %Y’)`”
fontsize: 11pt
fontfamily: times
geometry: margin=1in
output:
bookdown::pdf_document2:
toc: true
number_sections: true
keep_tex: true
citation_package: natbib
fig_caption: true
#toc_depth: 1
highlight: haddock
df_print: kable
extra_dependencies:
caption: [“labelfont={bf}”]
#pdf_document:
# extra_dependencies: [“flafter”]
#pdf_document2:
# extra_dependencies: [“float”]
bibliography: [refs.bib]
biblio-style: apalike
link-citations: yes
abstract: We demonstrate how various descriptive and inferenrial statistical methods can be applied to the `Decathlon` dataset, and how their results might be interpreted and presented. We study the evolution of athlete performance as function of time, and in show that while the best performances appear to increase with time, the mean and median performances appear to decrease over the same period. We illustrate the non-homogeneity of the current decathlon scoring scheme, and give some insight into the profile of the best-performing decathletes. We also perform a correlation analysis to explore relationships between the different events of the decathlon, and finally present the results of a logistic-regresion analysis and demonstrate that to a certain extent it is possible to distinguish between French and German decathletes by the scores the achieve on a subset of the decathlon events.
—
“`{r setup, include=FALSE}
library(knitr)
library(tidyverse)
library(kableExtra)
knitr::opts_chunk$set(echo = TRUE)
“`
“`{r, include=FALSE}
library(corrplot)
library(Hmisc)
library(car)
library(ppcor)
library(ggpubr)
library(MASS)
library(pROC)
# read data
Decathlon <- read.csv('decathlon.csv', header=TRUE)
Nobs <- nrow(Decathlon) # number of entries
```
# Introduction {#sec:intro}
The `Decathlon` data set records the performances of elite decathletes in international competitions over the period from 1985 to 2006. The decathlon is a combined event in athletics consisting of ten track-and-field events; the current decathlon world-record holder is the French decathlete Kevin Mayer, who achieved a score of $9{,}126$ points in 2018. One may refer to the following Wikipedia page for more information of the decathlon: [https://en.wikipedia.org/wiki/Decathlon](https://en.wikipedia.org/wiki/Decathlon).
The data set consists of $7{,}968$ observations of $24$ variables. These names of the variables are listed hereafter:
“`{r, echo = FALSE}
noquote(‘Variables names:’)
print(names(Decathlon))
“`
An entry of the data set consists of the total number of points scored by a decathlete, the name and nationality of the decathlete, and the year the performance was achieved. The raw performances for each of the 10 events are reported (with time in seconds and distance or height in meters), together with the number of points scored for these events. There are $2{,}709$ different decathletes of $107$ different nationalities in the data set.
**Remark**. For $435$ entries of the data set, we observed a difference of $1$ between the variable `Totalpoints`, corresponding to the total number of points scored by a decathlete for that performance, and the sum of the points scored during the $10$ events. We decided to apply a correction to the corresponding entries of the variable `Totalpoints`, so that they are all equal to the sum of the points scored during the $10$ events.
“`{r}
# Correction of `Totalpoints`
Decathlon$Totalpoints <- rowSums(Decathlon[,15:24])
```
In our study we pay a particular attention to the total number of points, the points scored during each event, the nationality of the decathletes, and the year the performances were achieved. The Pearson correlations between the total number of points and the points scored during the various events is illustrated by the correlogram shown in Figure \@ref(fig:correlo-pts-evt), while scatter plots and histograms showing the points scored during the three throw events (shot-put, discus and javelin throw) are presented in Figure \@ref(fig:pairs-throw-evt) (see appendix).
```{r correlo-pts-evt, echo = FALSE, fig.height = 4, fig.width = 4, fig.cap = "Graphical representation of the Pearson correlations between the total number of points and the points scored during each event.", fig.align = "center"}
CorrMat <- cor(Decathlon[c(1,15:24)])
corrplot(CorrMat, method="circle")
```
# Performance across years {#sec:perf-year}
In this section, we investigate the evolution of the overall performance (variable `Totalpoints`) as function of the year these performances were achieved (variable `yearEvent`). The number of obsevations for each year (or season) varies between $321$ and $399$. A graphical representation of the evolution of the best, mean and median preformace as a function of the year is shown in Figure \@ref(fig:Evo-Perf-Year).
```{r Evo-Perf-Year, echo = FALSE, fig.height = 5, fig.width = 9, fig.cap = "Evolution of the best, mean and median performance as a function of the year. In each case, the regression line (dashed black line) is shown togeter with the corresponding $95\\%$ confidence intervals (purple dashed curves; prediction without noise term).", fig.align = "center"}
Ind_Best_Year <- rep(0,22)
Mean_Perf_Year <- rep(0,22)
Median_Perf_Year <- rep(0,22)
for (i in 0:21){
Ind_Best_Year[i+1] <- which.max(Decathlon$Totalpoints * as.numeric(Decathlon$yearEvent == (1985+i)))
Mean_Perf_Year[i+1] <- mean(Decathlon$Totalpoints[Decathlon$yearEvent == (1985+i)])
Median_Perf_Year[i+1] <- median(Decathlon$Totalpoints[Decathlon$yearEvent == (1985+i)])
}
Top_Perf_Year <- Decathlon$Totalpoints[Ind_Best_Year]
Perf_Year <- data.frame(Year=1985:2006, Top=Top_Perf_Year, Mean=Mean_Perf_Year, Median=Median_Perf_Year)
mod1 <- lm(Top ~ Year, data=Perf_Year)
mod2 <- lm(Mean ~ Year, data=Perf_Year)
mod3 <- lm(Median ~ Year, data=Perf_Year)
#summary(mod3)
par(mfrow=c(1,2))
xcoord <-seq(1985, 2006,length.out=101)
newData <- data.frame(Year=xcoord)
#pred.w.plim <- predict(mod1, newData,
# interval = "prediction", level=0.95)
pred.w.clim <- predict(mod1, newData,
interval = "confidence", level=0.95)
matplot(xcoord,
#cbind(pred.w.clim, pred.w.plim[,-1]),
cbind(pred.w.clim),
lty = c(2,2,2), lwd=3,
#col = c('black', 'green3', 'green3', 'purple', 'purple'),
col = c('black', 'purple', 'purple'),
type = "l",
xlab = 'Year',
ylab = 'Total points',
ylim = c(min(Top_Perf_Year),max(Top_Perf_Year)),
#main='prediction for Male'
)
lines(1985:2006, Top_Perf_Year, type='o',
pch=19, col='red', cex=1, lwd=2)
legend('topleft', c('Best'), col=c('red'), lty=1, lwd=2,
pch=19, pt.cex=1)
#################################
#pred.w.plim <- predict(mod2, newData,
# interval = "prediction", level=0.95)
pred.w.clim <- predict(mod2, newData,
interval = "confidence", level=0.95)
matplot(xcoord,
#cbind(pred.w.clim, pred.w.plim[,-1]),
cbind(pred.w.clim),
lty = c(2,2,2), lwd=3,
#col = c('black', 'green3', 'green3', 'purple', 'purple'),
col = c('black', 'purple', 'purple'),
type = "l",
xlab = 'Year',
ylab = 'Total points',
ylim = c(min(Median_Perf_Year), max(Mean_Perf_Year)),
#main='prediction for Male'
)
lines(1985:2006, Mean_Perf_Year, type='o',
pch=19, col='chartreuse3', cex=1, lwd=2)
#pred.w.plim <- predict(mod3, newData,
# interval = "prediction", level=0.95)
pred.w.clim <- predict(mod3, newData,
interval = "confidence", level=0.95)
matlines(xcoord,
#cbind(pred.w.clim, pred.w.plim[,-1]),
cbind(pred.w.clim),
lty = c(2,2,2), lwd=3,
#col = c('black', 'green3', 'green3', 'purple', 'purple'),
col = c('black', 'purple', 'purple'),
type = "l",
xlab = 'Year',
ylab = 'Total points',
ylim = c(min(Median_Perf_Year),max(Mean_Perf_Year)),
#main='prediction for Male'
)
lines(1985:2006, Median_Perf_Year, type='o',
pch=19, col='orange', cex=1, lwd=2)
legend('topright', c('Mean', 'Median'), col=c('chartreuse3', 'orange'), lty=1, lwd=2,
pch=19, pt.cex=1)
```
## Evolution of performances {#sec:evo-perf}
Figure \@ref(fig:Evo-Perf-Year) suggests that the best season performance increases with time, while the mean and median season performances appear to decrease with time. This observation is confirmed by tests of *Spearman's rank correlation* between the time variables and the best, mean and median season performances. The results are shown in Table \@ref(tab:Evo-Perf-Corr-Test), where we observe
* for the best season performance a positive Spearman correlation of $0.398$, statistically significant at the significance level $\alpha=0.05$;
* for the mean season performance a negative Spearman correlation of $-0.551$, statistically significant at the significance level $\alpha=0.01$;
* for the median season performance a negative Spearman correlation of -0.581, statistically significant at the significance level $\alpha=0.01$.
```{r Evo-Perf-Corr-Test, echo=FALSE}
CorrT1 <- cor.test(1985:2006, Top_Perf_Year, method='spearman', alternative='greater')
CorrT2 <- cor.test(1985:2006, Mean_Perf_Year, method='spearman', alternative='less')
CorrT3 <- cor.test(1985:2006, Median_Perf_Year, method='spearman', alternative='less', exact=FALSE)
Evo_Perf_Test_res <- data.frame(performance=c('Best', 'Mean', 'Median'),
alternative=c(CorrT1$alternative,CorrT2$alternative,CorrT3$alternative),
p.value=c(CorrT1$p.value,CorrT2$p.value,CorrT3$p.value),
estimate=c(CorrT1$estimate,CorrT2$estimate,CorrT3$estimate))
knitr::kable(
Evo_Perf_Test_res,
caption = 'Result of the Spearman\'s rank correlation tests between the year and the season\'s best, mean and median performances.',
align = 'cccc',
booktabs = TRUE)%>%kable_styling(latex_options = “HOLD_position”)
“`
## Comparison of mean season performances {#sec:comp-season-perf}
Figure \@ref(fig:CI-Var-Mean-Year) gives an overview of the sample mean and sample variance of the performances achieved each year as represented by the `Totalpoints` variable; the corresponding confidence intervals (at the confidence level $70\%$) are also presented. We observe that 1988 has the largest sample mean and the second-largest sample variance among all years, while 1996 has the second-largest sample mean and the largest sample variance. Interestingly, 1988 and 1996 were both Olympic years, with the 1988 Games now infamous for many proven doping cases. We also observe a relatively strong overlap between the confidence intervals corresponding to these two years. The number of observations for each year are relatively large (at least $321$ observations), so by the central limit theorem the normal approximations used to compute these intervals are likely to be reasonably good.
“`{r CI-Var-Mean-Year, echo = FALSE, fig.height = 5, fig.width = 9, fig.cap = “Sample mean and sample variance for the performances achieved each season, and correponding confidence intervals at the confidence level $70\\%$. The horizontal intervals corresponds to the means (in orange), and the vertical ones to the variances (in green).”, fig.align = “center”}
VARxCIxComp<-function(Obs, alpha){
Nsamp<-length(Obs)
sss2<- var(Obs)
LowBound<- (Nsamp-1) * sss2 / qchisq(1-alpha/2, df=Nsamp-1)
UpBound <- (Nsamp-1) * sss2 / qchisq(alpha/2, df=Nsamp-1)
return(c(LowBound, UpBound))
}
alpha <- 0.3
Mean_TotPts_Year <- rep(0,22)
Var_TotPts_Year <- rep(0,22)
CI_mean_TotPts_Year <- matrix(0, 22, 2)
CI_var_TotPts_Year <- matrix(0, 22, 2)
Rec_numb_obs <- rep(0,22)
for (i in 0:21){
obs <- Decathlon$Totalpoints[Decathlon$yearEvent == (1985 + i)]
CI_mean <- t.test(obs, conf.level=1-alpha)$conf.int
CI_mean_TotPts_Year[i+1,] <- CI_mean
Mean_TotPts_Year[i+1] <- mean(obs)
CI_var_TotPts_Year[i+1,] <- VARxCIxComp(obs, alpha)
Var_TotPts_Year[i+1] <- var(obs)
}
plot(c(),c(),
type='p', pch=3, cex=0.5, lwd=2, col='blue',
xlab='Mean', ylab='Variance',
xlim=c(7275, 7460),
ylim=c(130000,215000))
for (i in 1:22){
lines(CI_mean_TotPts_Year[i,], c(1,1)*Var_TotPts_Year[i], lty=1, lwd=3, col='orange')
lines(c(1,1)*Mean_TotPts_Year[i], CI_var_TotPts_Year[i,], lty=1, lwd=3, col='chartreuse3')
}
for (i in 1:22){
text(Mean_TotPts_Year[i]+3, Var_TotPts_Year[i],
sprintf('%02d',(1984+i) %% 100),
pos=3, cex = 1.1)
}
points(Mean_TotPts_Year, Var_TotPts_Year,
type='p', pch=3, cex=0.5, lwd=2, col='blue')
```
Three different tests for homogeneity of variances return p-values between $0.052$ and $0.102$, as shown in Table \@ref(tab:TotPts-Year-Homosced). These results indicate that there is no strong statistical evidence to suggest that the variance of athlete performances differs across years: for each of the three tests we do not reject the null hypothesis of homoscedasticity at significance level $\alpha=0.05$.
```{r TotPts-Year-Homosced, echo=FALSE}
Decathlon$yearEvent_AsFactor <- factor(Decathlon$yearEvent)
Thmscd1 <- bartlett.test(Totalpoints ~ yearEvent_AsFactor, data=Decathlon)
Thmscd2 <- fligner.test(Totalpoints ~ yearEvent_AsFactor, data=Decathlon)
Thmscd3 <- leveneTest(Totalpoints ~ yearEvent_AsFactor, data=Decathlon)
Test_Tot_Year_Homosced <- data.frame(Test=c('Bartlett', 'Fligner-Killeen', 'Levene'),
p.value=c(Thmscd1$p.value, Thmscd2$p.value, Thmscd3[1,3]))
knitr::kable(
Test_Tot_Year_Homosced,
caption = 'Tests for homogeneity of variance for the performances (i.e. scores achieved by the decathletes) achieved each year.',
align = 'cc',
booktabs = TRUE)%>%kable_styling(latex_options = “HOLD_position”)
“`
Next we apply a one-way ANOVA to see whether there is a significant difference between the mean performance achieved each year, and find that it returns a p-value smaller than $0.0002$ (see below), which suggests that there is a statistically significant difference between the mean performance for at least two years. Notice that the number of observations for each year is roughly the same, so the data are approximately balanced. Tests indicate that the data are not normally distributed, however the number of observations in each year group is relatively large (at least than $321$), so by the central limit theorem the conclusion of this ANOVA is likely to be valid. We can also report that ANOVA performed with the `oneway.test` function (which does not assume assuming equality of variance) returns a p-value of similar magnitude.
“`{r}
AOV1 <- aov(Totalpoints ~ yearEvent_AsFactor, data=Decathlon)
print(summary(AOV1))
```
## Comparison between 1988 and 1996 {#sec:1988-1996}
In Figure \@ref(fig:Evo-Perf-Year) we observe that the sample mean of the overall performances for 1988 is the largest among all years. (A similar observation holds for the sample median, while 1988 interestingly also has the the lowest season's best performance.) We now test to see whether there is a statistically significant difference between the mean performances for $1988$ and $1996$, the year having the second-largest sample mean of the overall performance.
```{r, echo=FALSE}
PvalFt <- (var.test(Decathlon$Totalpoints[Decathlon$yearEvent == 1988],
Decathlon$Totalpoints[Decathlon$yearEvent == 1996],
alternative = "two.sided", conf.level=0.95))$p.value
PvalTt <- (t.test(Decathlon$Totalpoints[Decathlon$yearEvent == 1988],
Decathlon$Totalpoints[Decathlon$yearEvent == 1996],
alternative = "two.sided", conf.level=0.95))$p.value
noquote( sprintf('F-test (compararison of variances); p-value=%.4f', PvalFt) )
noquote( sprintf('t-test (compararison of means); p-value=%.4f', PvalTt) )
```
The $F$-test for equality of variances indicates that there is no statistical evidences to suggest a difference between the variance of the two samples. The two-sample $t$-test then indicates that there is no statistical evidence to suggest a difference between the mean of the two samples. Note that that although the two samples might not be normally distributed, the relatively large sample sizes ensure that normal approximation will work well here.
# Differences between events {#sec:diff-evts}
Figure \@ref(fig:Diff-Pts-Evt) illustrates the sample mean, sample median and sample variance of the number of points scored in each of the $10$ decathlon events, where we observe that the scheme for awarding points does not appear to be homogeneous. For example, decathletes on average seem to score more points for the 110m hurdles than for the javelin event, and the variance of the points scored for the pole vault appears to be significantly larger that the variance of the points scored for the 100m.
```{r Diff-Pts-Evt, echo = FALSE, fig.height = 3.5, fig.width = 9, fig.cap = "Sample mean, median and variance od the number of points scored during each event. Values for the mean and the median read on the left-hand-side axis, and on the right-hand-side axis for the variance. ", fig.align = "center"}
EvtNames <- colnames(Decathlon)[15:24]
VarEvt <- apply(Decathlon[,15:24], 2, var)
MedianEvt <- apply(Decathlon[,15:24], 2, median)
MeanEvt <- apply(Decathlon[,15:24], 2, mean)
par(mar=c(3,3,1,3))
CEX1 <- 2
CEX2 <- 1.8
CEX3 <- 1.2
plot(1:10, MedianEvt,
type='o', pch=17, lwd=3, cex=CEX2, col='orange',
xlab='', ylab='',
xaxt='n')
lines(1:10, MeanEvt,
type='o', lty=2, pch=18, lwd=3, cex=CEX1, col='chartreuse3')
axis(1, at=seq(1,10,by=1), labels=EvtNames, las=0)
legend('topright', c('Mean', 'Median', 'Variance'),
col=c('chartreuse3', 'orange', 'firebrick2'),
lty=c(2,1,1), lwd=3,
pch=c(18,17,19), pt.cex=c(CEX1,CEX2, CEX3),
y.intersp=1.3)
# Allow a second plot on the same graph
par(new=TRUE, par(xpd=FALSE))
colNsupp<-'blue'
plot(1:10, VarEvt,
type='o', pch=19, lwd=3, cex=CEX3, col='firebrick2',
xlab='', ylab='',
xaxt='n', yaxt='n')
axis(4, at=seq(4000,10000,by=2000),
labels=seq(4000,10000,by=2000),
col='black', col.axis='black' ,las=0)
```
## Difference between the median number of points scored during each event {#sec:diff-med-evt}
To test the statistical significance of the apparent non-homogeneity of the decathlon scoring scheme suggested by Figure \@ref(fig:Diff-Pts-Evt), we perform a Kruskal-Wallis rank sum test to see whether there is a significant difference between the median number of points scored during each event (see below). The obtained p-value is extremely small, so there is strong statistical evidence to indicate that the median number of points differ for at least two events.
```{r, echo=FALSE}
PtsEvtNames <- colnames(Decathlon)[15:24]
Event <- c()
Pts_Event <- c()
for (i in 1:10){
Event <- c(Event, rep(PtsEvtNames[i], Nobs))
Pts_Event <- c(Pts_Event, unname(Decathlon[,14+i]))
}
Event <- factor(Event)
kruskal.test(Pts_Event ~ Event)
```
Next we perform a Wilcoxon rank-sum test (with continuity correction) to test whether the number of points scored for the pole-vault event is larger that for the javelin-throw event. We perform the test for *paired* observations, because the same decathlete achieved both scores. The result of the test are summarized below.
```{r, echo=FALSE}
#noquote(colnames(Decathlon)[c(20,23)])
noquote('Difference between the medians of Ppv and Pjt:')
Wil_Pv_Jt <- wilcox.test(Decathlon[,20], Decathlon[,23],
alternative='greater', paired=TRUE,
exact=FALSE, correct=TRUE)
noquote(sprintf('wilcox.test p-value=%.4f', Wil_Pv_Jt$p.value))
```
The obtained p-value is extremely small so there is strong statistical evidence to suggest that the median number of points scored during the pole-vault event is larger that for the javelin-throw event. To a lesser extent, we also observe a statistically significant difference between the median number of points scored for 100m and long-jump events, even though these two look relatively close in Figure \@ref(fig:Diff-Pts-Evt)):
```{r, echo=FALSE}
#noquote(colnames(Decathlon)[c(15,16)])
noquote('Difference between the medians of P100m and Plj:')
Wil_100_Lj <- wilcox.test(Decathlon[,15], Decathlon[,16],
alternative='two.sided', paired=TRUE,
exact=FALSE, correct=TRUE)
noquote(sprintf('wilcox.test p-value=%.4f', Wil_100_Lj$p.value))
```
## Profile of the season best performers {#sec:profil-best}
To identify which events appears to be the more decisive in determining the overall winner, for each year we compute the in-season ranking for each event of the decathlete with the best overall performance, illustrated in Figure \@ref(fig:in-season-rank-top).
```{r in-season-rank-top, echo = FALSE, fig.height = 5, fig.width = 9, fig.cap = "Boxplot of the of in-season rank achieved by the season best-performers for each event of the Decathlon.", fig.align = "center"}
Year_Perf_rank <- matrix(0,22,10)
for (i in 0:21){
for (j in 15:24){
Season_Rank <- (apply(-Decathlon[(Decathlon$yearEvent == (1985+i)),15:24], 2, rank))[1,]
Year_Perf_rank[(i+1),] <- Season_Rank
}
}
colnames(Year_Perf_rank) <- colnames(Decathlon)[15:24]
rownames(Year_Perf_rank) <- 1985:2006
options(repr.plot.width=10, repr.plot.height=7)
boxplot(Year_Perf_rank, col='mediumorchid3', ylab='In-season ranking')
```
Although only a descriptive analysis, the rank-based analysis associated with Figure \@ref(fig:in-season-rank-top) seems to highlight some interesting features
* The event appearing as the less decisive is the 1550m (in view of Figure \@ref(fig:in-season-rank-top)). This can in our opinion be at least partially explained by the following reasons. The 1500m is the last event of the $10$, occurring at the end of the second day, so that only the decathletes in close fight for the victory or to beat their personal record have an interest to try to perform well during this event (and they are certainly all tired by the two days of competition). The 1500m is also the only pure-resistance event (the other event including a resistance component is the 400m), so that there is no real interest for a decathlete to specialise in this event. This observation is in total agreement with the fact that the 1500m appears to be the event which is the less correlated with the others and with the total score achieved by a decathlete; see Figure \@ref(fig:correlo-pts-evt).
* The event appearing as the most decisive is the the 110m hurdles; in the data, the best season performer acheived the best or second-best season performance at the 110m hurdles in $50\%$ of the cases. This might be explained by the fact that the 110m-hurdles requires very good explosivity and speed ability, combined with a strong technique and an excellent coordination; the risk of falling during the race is also very high in comparison to the other track events.
* The second, third and fourth most decisive events appear to be the 100m, the long jump and the 400m, respectively. Although being a jump event, the qualities to perform well at the long jump are very similar to the qualities required to be a good sprinter (the correlation between the score achieved during the long jump and the 100m is actually relative strong, at $0.49$).
These observation suggest that top-performing decathletes excel in the "speed-related" events, and perform relatively well in the other events except for the 1500m.
## Partial correlation between events {#sec:partial-corr}
To further investigate the relationships between the different events of the decathlon, we perform a partial correlation analysis by computing the partial correlations between the points scored for every pair of events while controlling for all the other event. The computed partial correlations are illustrated in Figure \@ref(fig:partial-correlo-pts-evt).
```{r partial-correlo-pts-evt, echo = FALSE, fig.height = 4, fig.width = 4, fig.cap = "Graphical representation of the partial correlations between the points scored during every pairs of events while controlling for all the other event.", fig.align = "center" }
AllPcor <- pcor(as.matrix(Decathlon[,c(15:24)]) )$estimate
corrplot(AllPcor, method="circle")
```
Interestingly, and in comparison to the correlations shown in Figure \@ref(fig:correlo-pts-evt), if we control for all the other events, we observe only a few relatively strong partial correlations between pairs of events (all of which are significant at $\alpha=0.01$). In particular we observe that:
* `P100m` and `P1500` are negatively correlated,
* `P400m` and `P1500` are positively correlated,
* `P400m` and `P100m` are positively correlated,
* `Psp` and `Pdt` are positively correlated.
The positive partial correlations for the pairs `P400m-P100m` and `P400m-P1500`, and the negative partial correlation for the pair `P1500-P100m`, might be consequence of the fact that decathletes performing well at the 400m may either be very fast or have good resistance. The relatively strong partial correlation between `Psp` and `Pdt` is also not surprising, because the shot put and discuss throw require very similar physical abilities and technical skills.
# Differences between French and German decathletes {#sec:french-german}
In this section, we explore the possibility of differentiating between French and German decathletes by using a logistic regression model based on the points scored during a selection of events. To this extent, we extract the entries of the data set corresponding to the points scored during the 10 events by the French and German decathletes, and gather these entries in a data set named `data_FRAxGER`; a categorical variable `IsFrench` is added to this data set, taking the value $1$ when the entry corresponds to a French decathlete, and $0$ for a German decathlete. The resulting data set has $1{,}297$ entries, with $40.17\%$ corresponding to French decathletes. Notice that no decathletes appear more than $12$ times in the extracted data set.
```{r, echo=FALSE}
Ind_FRA <- (1:Nobs)[Decathlon$Nationality == 'FRA']
Ind_GER <- (1:Nobs)[Decathlon$Nationality == 'GER']
Ind_FRAxGER <- c(Ind_FRA,Ind_GER)
n_FRA <- length(Ind_FRA)
n_GER <- length(Ind_GER)
n_FRAxGER <- n_FRA + n_GER
noquote(sprintf('Number of French entries: %d', n_FRA))
noquote(sprintf('Number of German entries: %d', n_GER))
IsFrench <- rep(0,n_FRAxGER)
IsFrench[1:n_FRA] <- 1
IsFrench <- factor(IsFrench)
Diff_FRA <- length(unique(Decathlon$DecathleteName[Ind_FRA]))
Diff_GER <- length(unique(Decathlon$DecathleteName[Ind_GER]))
noquote(sprintf('Number of different French decathletes: %d', Diff_FRA))
noquote(sprintf('Number of different German decathletes: %d', Diff_GER))
data_FRAxGER <- Decathlon[Ind_FRAxGER, 15:24]
data_FRAxGER$IsFrench <- IsFrench
```
To select a set of relevant events, we use the `stepAIC` function of the `MASS` package. The stepwise procedure is initialsed from the logistic model based on the points scored during the 10 events. The resulting model, named `step.model`, depends on $7$ events, as described below (where the coefficients of the resulting model are given).
```{r, echo=FALSE}
full.model <- glm(IsFrench ~ Plj + Pdt + P400m + Ppv + Pjt + Phj + P110h + P1500 + Psp + P100m,
family=binomial(link='logit'), data=data_FRAxGER)
step.model <- full.model %>% stepAIC(trace = FALSE)
#summary(step.model)
noquote(‘Coefficients of step.model:’)
print(step.model$coefficient[1:5])
print(step.model$coefficient[6:8])
“`
The ROC curve corresponding to the model `step.model` is given in Figure \@ref(fig:ROC-curve). Although not especially impressive, the model appears to be able to identify some statistically significant differences between the French and German decathletes. Based on the coefficients of `step.model`, French decathletes appear, to a certain extent, to be better their German counterparts in discuss throw, pole vault, high jump and 100m (positive coefficients), while German decathletes on average seem to perform better in the 400m, 110m-hurdles and shot-put event (negative coefficients).
“`{r ROC-curve, echo = FALSE, fig.height = 3, fig.width = 3, fig.cap = “ROC curve for the logistic regression model \\texttt{step.model}; the orange dashed lines correspond to the sensiblility and specificity of the resulting binary classifier for a decision treshold at $40\\%$.”, fig.align = “center”}
confusion.glm <- function(data, model, tresh){
prediction <- ifelse(predict(model, data, type='response') > tresh, TRUE, FALSE)
confusion <- table(prediction, as.logical(model$y))
confusion <- cbind(confusion, c(1 - confusion[1,1]/(confusion[1,1]+confusion[2,1]),
1 - confusion[2,2]/(confusion[2,2]+confusion[1,2])))
confusion <- as.data.frame(confusion)
names(confusion) <- c('FALSE', 'TRUE', 'class.error')
confusion
}
prob <- predict(step.model, type='response')
data_FRAxGER$prob <- prob
ROC_curve <- roc(IsFrench ~ prob, data=data_FRAxGER,
levels=c(0,1), direction='<')
options(repr.plot.width=7, repr.plot.height=7)
plot(ROC_curve)
# Confusion matrix at a given decision treshold
Confu_T <- confusion.glm(data_FRAxGER, step.model, 0.4)
SensitivityT <- Confu_T[2,2]/sum(Confu_T[,2])
SpecificityT <- Confu_T[1,1]/sum(Confu_T[,1])
abline(v=SpecificityT, col='orange', lwd=2, lty=2)
abline(h=SensitivityT , col='orange', lwd=2, lty=2)
```
With a decision threshold set at $40\%$, the binary classifier based on the logistic model `setp.model` correctly classifies $69.29\%$ of the French decathletes, and $67.27\%$ of the German decathletes, as reported below (see also Figure \@ref(fig:ROC-curve)). Note that the threshold was set at $40\%$ to obtain balanced percentage of classification errors.
```{r, echo=FALSE}
noquote('Confusion matrix for step.model with 40% decision treshold:')
print(Confu_T)
```
Note also that in view of their Z-values, the long jump and the 110m hurdles events appear less influent than the other variables. Thus we removed these two predictors from the model, however the conclusions drawn from an analysis of the reduced model are similar to the conclusions based on `step.model`.
```{r, echo=FALSE}
noquote('p-values for the Z-values of step.model:')
print(coef(summary(step.model))[,4][1:5])
print(coef(summary(step.model))[,4][6:8])
```
# Conclusion
In this report, we have used various statistical tools to explore the `Decathlon` data set. We performed some *correlation analyses*, both parametric and non-parametric, and some *tests on means and variances* using the $t$-test, $F$-test and ANOVA, together with various tests for checking conditions. We also performed some *non-paremetric tests* for the median of certain quantities, including the Wilcoxon and Kruskal-Wallis rank-sum tests, and computed *confidence intervals* for certain means and variances. In addition, we computed and discussed some *linear regression* and *logistic regression* models, and used various *graphical representation tools* to illustrate the data and our findings. For the parametric tests, sample sizes were generally large enough to ensure the *validity of the normal approximation framework*.
Our main observations can be summarised as follows:
1. the best season performances appear to increase with the years, while the mean and median performances appear to decrease over the same period (Section \@ref(sec:perf-year));
2. the year 1988 is relatively "special", but not in statistically significant way (Section \@ref(sec:perf-year));
3. the current scoring scheme of the decathlon is not homogeneous (Section \@ref(sec:diff-evts));
4. the best decathletes seem, in general, to be those who outperform the others in "pure-speed" events, including the long-jump (Section \@ref(sec:diff-evts));
5. there exist relatively strong partial correlations between groups of events (Section \@ref(sec:diff-evts));
6. certain countries seem, to a certain extent, to yield decathletles with specific profiles (Section \@ref(sec:french-german)).
These observations are in our opinion interesting and appear to be statistically significant. Our conclusions should nevertheless be checked and refined by further analyses, potentially using other sources of data.
# Appendix
```{r pairs-throw-evt, echo = FALSE, fig.height = 6, fig.width = 6, fig.cap = "Scatter plots, histograms and correlations for the three throw events.", fig.align = "center" }
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
panel.cor <- function(x, y, ...)
{
par(usr = c(0, 1, 0, 1))
txt <- as.character(format(cor(x, y), digits=2))
text(0.5, 0.5, txt, cex = 6* abs(cor(x, y)))
}
options(repr.plot.width=8, repr.plot.height=8)
options(repr.plot.width=8, repr.plot.height=8)
pairs(Decathlon[c(17,22,23)], main = "Throw events",
pch = 21,
bg = 'coral2',
upper.panel=panel.cor,
diag.panel=panel.hist)
```
A study of the Decathlon datase
t
A Student
2
0 February 202
1
Abstract
We demonstrate how various descriptive and inferenrial statistical methods can be applied to the
Decathlon dataset, and how their results might be interpreted and presented. We study the evolution of
athlete performance as function of time, and in show that while the best performances appear to increase
with time, the mean and median performances appear to decrease over the same period. We illustrate
the non-homogeneity of the current decathlon scoring scheme, and give some insight into the profile of
the best-performing decathletes. We also perform a correlation analysis to explore relationships between
the different events of the decathlon, and finally present the results of a logistic-regresion analysis and
demonstrate that to a certain extent it is possible to distinguish between French and German decathletes
by the scores the achieve on a subset of the decathlon events.
Contents
1
2
2
2
2.1 Evolution of performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.2 Comparison of mean season performances . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.
3
Comparison between 1
9
8
8 and 199
6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
3
6
3.1 Difference between the median number of points scored during each event . . . . . . . . . . 6
3.2 Profile of the season best performers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
3.3 Partial correlation between events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4
1
0
5
1
1
6
12
1
1 Introduction
The Decathlon data set records the performances of elite decathletes in international competitions over
the period from 1985 to 2006. The decathlon is a combined event in athletics consisting of ten track-and-
field events; the current decathlon world-record holder is the French decathlete Kevin Mayer, who achieved
a score of 9,126 points in 2018. One may refer to the following Wikipedia page for more information of the
decathlon: https://en.wikipedia.org/wiki/Decathlon.
The data set consists of 7,968 observations of 24 variables. These names of the variables are listed hereafter:
## [1] Variables names:
## [1] “Totalpoints” “DecathleteName” “Nationality” “m
10
0″
## [5] “Longjump” “Shotput” “Highjump” “m400”
## [9] “m
11
0hurdles” “Discus” “Polevault” “Javelin”
## [
13
] “m1500” “yearEvent” “P100m” “Plj”
## [17] “Psp” “Phj” “P400m” “P110h”
## [21] “Ppv” “Pdt” “Pjt” “P1500”
An entry of the data set consists of the total number of points scored by a decathlete, the name and nationality
of the decathlete, and the year the performance was achieved. The raw performances for each of the 10
events are reported (with time in seconds and distance or height in meters), together with the number of
points scored for these events. There are 2,709 different decathletes of 107 different nationalities in the data
set.
Remark. For 435 entries of the data set, we observed a difference of 1 between the variable Totalpoints,
corresponding to the total number of points scored by a decathlete for that performance, and the sum of the
points scored during the 10 events. We decided to apply a correction to the corresponding entries of the
variable Totalpoints, so that they are all equal to the sum of the points scored during the 10 events.
# Correction of `Totalpoints`
Decathlon$Totalpoints <- rowSums(Decathlon[,15:24])
In our study we pay a particular attention to the total number of points, the points scored during each event,
the nationality of the decathletes, and the year the performances were achieved. The Pearson correlations
between the total number of points and the points scored during the various events is illustrated by the
correlogram shown in Figure 1, while scatter plots and histograms showing the points scored during the
three throw events (shot-put, discus and javelin throw) are presented in Figure 8 (see appendix).
2 Performance across years
In this section, we investigate the evolution of the overall performance (variable Totalpoints) as func-
tion of the year these performances were achieved (variable yearEvent). The number of obsevations for
each year (or season) varies between 321 and 399. A graphical representation of the evolution of the best,
mean and median preformace as a function of the year is shown in Figure 2.
2
https://en.wikipedia.org/wiki/Decathlon
â1
â
0
.8
â
0.6
â
0.4
â
0.2
0
0.2
0.4
0.6
0.8
1
To
ta
lp
o
in
ts
P
1
0
0
m
P
l
j
P
sp
P
h
j
P
4
0
0
m
P
1
1
0
h
P
p
v
P
d
t
P
jt
P
1
5
0
0
Totalpoints
P100m
Plj
Psp
Phj
P400m
P110h
Ppv
Pdt
Pjt
P15
00
Figure 1: Graphical representation of the Pearson correlations between the total number of points and the
points scored during each event.
1985 1990 1995 2000 20
05
8
5
0
0
8
6
0
0
8
7
0
0
8
8
0
0
8
9
0
0
9
0
0
0
Year
To
ta
l p
o
in
ts
Best
1985 1990 1995 2000 2005
7
2
0
0
7
2
5
0
7
3
0
0
7
3
5
0
7
4
0
0
Year
To
ta
l p
o
in
ts
Mean
Median
Figure 2: Evolution of the best, mean and median performance as a function of the year. In each case, the
regression line (dashed black line) is shown togeter with the corresponding 95% confidence intervals (purple
dashed curves; prediction without noise term).
3
2.1 Evolution of performances
Figure 2 suggests that the best season performance increases with time, while the mean and median season
performances appear to decrease with time. This observation is confirmed by tests of Spearmanâs rank
correlation between the time variables and the best, mean and median season performances. The results are
shown in Table 1, where we observe
⢠for the best season performance a positive Spearman correlation of 0.398, statistically significant at
the significance level α = 0.05;
⢠for the mean season performance a negative Spearman correlation of â0.551, statistically significant
at the significance level α = 0.01;
⢠for the median season performance a negative Spearman correlation of -0.581, statistically significant
at the significance level α = 0.01.
Table 1: Result of the Spearmanâs rank correlation tests between the year and the seasonâs best, mean and
median performances.
performance alternative p.value estimate
Best greater 0.0337857 0.39808
02
Mean less 0.0044491 -0.5505364
Median less 0.0022965 -0.5807911
2.2 Comparison of mean season performances
Figure 3 gives an overview of the sample mean and sample variance of the performances achieved each year
as represented by the Totalpoints variable; the corresponding confidence intervals (at the confidence
level 70%) are also presented. We observe that 1988 has the largest sample mean and the second-largest
sample variance among all years, while 1996 has the second-largest sample mean and the largest sample
variance. Interestingly, 1988 and 1996 were both Olympic years, with the 1988 Games now infamous for
many proven doping cases. We also observe a relatively strong overlap between the confidence intervals
corresponding to these two years. The number of observations for each year are relatively large (at least 321
observations), so by the central limit theorem the normal approximations used to compute these intervals
are likely to be reasonably good.
Three different tests for homogeneity of variances return p-values between 0.052 and 0.102, as shown in
Table 2. These results indicate that there is no strong statistical evidence to suggest that the variance of
athlete performances differs across years: for each of the three tests we do not reject the null hypothesis of
homoscedasticity at significance level α = 0.05.
Table 2: Tests for homogeneity of variance for the performances (i.e. scores achieved by the decathletes)
achieved each year.
Test p.value
Bartlett 0.0521832
Fligner-Killeen 0.0575033
Levene 0.1015971
4
7300 7350 7400 7450
1
4
0
0
0
0
1
8
0
0
0
0
Mean
V
a
ri
a
n
ce
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
00
01
02
03
04
05
06
Figure 3: Sample mean and sample variance for the performances achieved each season, and correponding
confidence intervals at the confidence level 70%. The horizontal intervals corresponds to the means (in
orange), and the vertical ones to the variances (in green).
Next we apply a one-way ANOVA to see whether there is a significant difference between the mean perfor-
mance achieved each year, and find that it returns a p-value smaller than 0.0002 (see below), which suggests
that there is a statistically significant difference between the mean performance for at least two years. Notice
that the number of observations for each year is roughly the same, so the data are approximately balanced.
Tests indicate that the data are not normally distributed, however the number of observations in each year
group is relatively large (at least than 321), so by the central limit theorem the conclusion of this ANOVA
is likely to be valid. We can also report that ANOVA performed with the oneway.test function (which
does not assume assuming equality of variance) returns a p-value of similar magnitude.
AOV1 <- aov(Totalpoints ~ yearEvent_AsFactor, data=Decathlon) print(summary(AOV1))
## Df Sum Sq Mean Sq F value Pr(>F)
## yearEvent_AsFactor 21 8.894e+06 423547 2.487 0.000184 ***
## Residuals 7946 1.353e+09 170302
## —
## Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
2.3 Comparison between 1988 and 1996
In Figure 2 we observe that the sample mean of the overall performances for 1988 is the largest among all
years. (A similar observation holds for the sample median, while 1988 interestingly also has the the lowest
5
seasonâs best performance.) We now test to see whether there is a statistically significant difference between
the mean performances for 1988 and 1996, the year having the second-largest sample mean of the overall
performance.
## [1] F-test (compararison of variances); p-value=0.7034
## [1] t-test (compararison of means); p-value=0.4546
The F -test for equality of variances indicates that there is no statistical evidences to suggest a difference
between the variance of the two samples. The two-sample t-test then indicates that there is no statistical evi-
dence to suggest a difference between the mean of the two samples. Note that that although the two samples
might not be normally distributed, the relatively large sample sizes ensure that normal approximation will
work well here.
3 Differences between events
Figure 4 illustrates the sample mean, sample median and sample variance of the number of points scored
in each of the 10 decathlon events, where we observe that the scheme for awarding points does not appear
to be homogeneous. For example, decathletes on average seem to score more points for the 110m hurdles
than for the javelin event, and the variance of the points scored for the pole vault appears to be significantly
larger that the variance of the points scored for the 100m.
6
5
0
7
0
0
7
5
0
8
0
0
P100m Plj Psp Phj P400m P110h Ppv Pdt Pjt
P1500
Mean
Median
Variance
4
0
0
0
6
0
0
0
8
0
0
0
1
0
0
0
0
Figure 4: Sample mean, median and variance od the number of points scored during each event. Values for
the mean and the median read on the left-hand-side axis, and on the right-hand-side axis for the variance.
3.1 Difference between the median number of points scored during each event
To test the statistical significance of the apparent non-homogeneity of the decathlon scoring scheme sug-
gested by Figure 4, we perform a Kruskal-Wallis rank sum test to see whether there is a significant difference
between the median number of points scored during each event (see below). The obtained p-value is ex-
tremely small, so there is strong statistical evidence to indicate that the median number of points differ for
at least two events.
6
##
## Kruskal-Wallis rank sum test
##
## data: Pts_Event by Event
## Kruskal-Wallis chi-squared = 34243, df = 9, p-value < 2.2e-16
Next we perform a Wilcoxon rank-sum test (with continuity correction) to test whether the number of points
scored for the pole-vault event is larger that for the javelin-throw event. We perform the test for paired
observations, because the same decathlete achieved both scores. The result of the test are summarized
below.
## [1] Difference between the medians of Ppv and Pjt:
## [1] wilcox.test p-value=0.0000
The obtained p-value is extremely small so there is strong statistical evidence to suggest that the median
number of points scored during the pole-vault event is larger that for the javelin-throw event. To a lesser
extent, we also observe a statistically significant difference between the median number of points scored for
100m and long-jump events, even though these two look relatively close in Figure 4):
## [1] Difference between the medians of P100m and Plj:
## [1] wilcox.test p-value=0.0226
3.2 Profile of the season best performers
To identify which events appears to be the more decisive in determining the overall winner, for each year we
compute the in-season ranking for each event of the decathlete with the best overall performance, illustrated
in Figure 5.
Although only a descriptive analysis, the rank-based analysis associated with Figure 5 seems to highlight
some interesting features
⢠The event appearing as the less decisive is the 1550m (in view of Figure 5). This can in our opinion be
at least partially explained by the following reasons. The 1500m is the last event of the 10, occurring
at the end of the second day, so that only the decathletes in close fight for the victory or to beat their
personal record have an interest to try to perform well during this event (and they are certainly all
tired by the two days of competition). The 1500m is also the only pure-resistance event (the other
event including a resistance component is the 400m), so that there is no real interest for a decathlete
to specialise in this event. This observation is in total agreement with the fact that the 1500m appears
to be the event which is the less correlated with the others and with the total score achieved by a
decathlete; see Figure 1.
⢠The event appearing as the most decisive is the the 110m hurdles; in the data, the best season performer
acheived the best or second-best season performance at the 110m hurdles in 50% of the cases. This
might be explained by the fact that the 110m-hurdles requires very good explosivity and speed ability,
combined with a strong technique and an excellent coordination; the risk of falling during the race is
also very high in comparison to the other track events.
7
P100m Plj Psp Phj P400m P110h Ppv Pdt Pjt P1500
0
5
0
1
0
0
1
5
0
2
0
0
2
5
0
3
0
0
In
â
se
a
so
n
r
a
n
ki
n
g
Figure 5: Boxplot of the of in-season rank achieved by the season best-performers for each event of the
Decathlon.
⢠The second, third and fourth most decisive events appear to be the 100m, the long jump and the
400m, respectively. Although being a jump event, the qualities to perform well at the long jump are
very similar to the qualities required to be a good sprinter (the correlation between the score achieved
during the long jump and the 100m is actually relative strong, at 0.49).
These observation suggest that top-performing decathletes excel in the âspeed-relatedâ events, and perform
relatively well in the other events except for the 1500m.
3.3 Partial correlation between events
To further investigate the relationships between the different events of the decathlon, we perform a partial
correlation analysis by computing the partial correlations between the points scored for every pair of events
while controlling for all the other event. The computed partial correlations are illustrated in Figure 6.
Interestingly, and in comparison to the correlations shown in Figure 1, if we control for all the other events,
we observe only a few relatively strong partial correlations between pairs of events (all of which are signifi-
cant at α = 0.01). In particular we observe that:
⢠P100m and P1500 are negatively correlated,
⢠P400m and P1500 are positively correlated,
⢠P400m and P100m are positively correlated,
⢠Psp and Pdt are positively correlated.
The positive partial correlations for the pairs P400m-P100m and P400m-P1500, and the negative partial
correlation for the pair P1500-P100m, might be consequence of the fact that decathletes performing well
8
â1
â0.8
â0.6
â0.4
â0.2
0
0.2
0.4
0.6
0.8
1
P
1
0
0
m
P
lj
P
sp
P
h
j
P
4
0
0
m
P
1
1
0
h
P
p
v
P
d
t
P
jt
P
1
5
0
0
P100m
Plj
Psp
Phj
P400m
P110h
Ppv
Pdt
Pjt
P1500
Figure 6: Graphical representation of the partial correlations between the points scored during every pairs
of events while controlling for all the other event.
9
at the 400m may either be very fast or have good resistance. The relatively strong partial correlation between
Psp and Pdt is also not surprising, because the shot put and discuss throw require very similar physical
abilities and technical skills.
4 Differences between French and German decathletes
In this section, we explore the possibility of differentiating between French and German decathletes by
using a logistic regression model based on the points scored during a selection of events. To this extent, we
extract the entries of the data set corresponding to the points scored during the 10 events by the French and
German decathletes, and gather these entries in a data set named data_FRAxGER; a categorical variable
IsFrench is added to this data set, taking the value 1 when the entry corresponds to a French decathlete,
and 0 for a German decathlete. The resulting data set has 1,297 entries, with 40.17% corresponding to
French decathletes. Notice that no decathletes appear more than 12 times in the extracted data set.
## [1] Number of French entries: 521
## [1] Number of German entries: 776
## [1] Number of different French decathletes: 151
## [1] Number of different German decathletes: 247
To select a set of relevant events, we use the stepAIC function of the MASS package. The stepwise pro-
cedure is initialsed from the logistic model based on the points scored during the 10 events. The resulting
model, named step.model, depends on 7 events, as described below (where the coefficients of the result-
ing model are given).
## [1] Coefficients of step.model:
## (Intercept) Pdt P400m Ppv Phj
## 2.425113260 0.005761762 -0.005195875 0.004739631 0.001693611
## P110h Psp P100m
## -0.001676468 -0.015800780 0.005794585
The ROC curve corresponding to the model step.model is given in Figure 7. Although not especially
impressive, the model appears to be able to identify some statistically significant differences between the
French and German decathletes. Based on the coefficients of step.model, French decathletes appear, to
a certain extent, to be better their German counterparts in discuss throw, pole vault, high jump and 100m
(positive coefficients), while German decathletes on average seem to perform better in the 400m, 110m-
hurdles and shot-put event (negative coefficients).
With a decision threshold set at 40%, the binary classifier based on the logistic model setp.model cor-
rectly classifies 69.29% of the French decathletes, and 67.27% of the German decathletes, as reported below
(see also Figure 7). Note that the threshold was set at 40% to obtain balanced percentage of classification
errors.
10
Specificity
S
e
n
si
tiv
ity
1.0 0.6 0.2
0
.0
0
.4
0
.8
Figure 7: ROC curve for the logistic regression model step.model; the orange dashed lines correspond
to the sensiblility and specificity of the resulting binary classifier for a decision treshold at 40%.
## [1] Confusion matrix for step.model with 40% decision treshold:
## FALSE TRUE class.error
## FALSE 522 160 0.3273196
## TRUE 254 361 0.3071017
Note also that in view of their Z-values, the long jump and the 110m hurdles events appear less influent than
the other variables. Thus we removed these two predictors from the model, however the conclusions drawn
from an analysis of the reduced model are similar to the conclusions based on step.model.
## [1] p-values for the Z-values of step.model:
## (Intercept) Pdt P400m Ppv Phj
## 3.483006e-02 4.743905e-07 3.990390e-05 1.665840e-11 7.215314e-02
## P110h Psp P100m
## 1.481284e-01 2.258318e-29 3.469760e-05
5 Conclusion
In this report, we have used various statistical tools to explore the Decathlon data set. We performed some
correlation analyses, both parametric and non-parametric, and some tests on means and variances using the
t-test, F -test and ANOVA, together with various tests for checking conditions. We also performed some
non-paremetric tests for the median of certain quantities, including the Wilcoxon and Kruskal-Wallis rank-
sum tests, and computed confidence intervals for certain means and variances. In addition, we computed and
11
discussed some linear regression and logistic regression models, and used various graphical representation
tools to illustrate the data and our findings. For the parametric tests, sample sizes were generally large
enough to ensure the validity of the normal approximation framework.
Our main observations can be summarised as follows:
1. the best season performances appear to increase with the years, while the mean and median perfor-
mances appear to decrease over the same period (Section 2);
2. the year 1988 is relatively âspecialâ, but not in statistically significant way (Section 2);
3. the current scoring scheme of the decathlon is not homogeneous (Section 3);
4. the best decathletes seem, in general, to be those who outperform the others in âpure-speedâ events,
including the long-jump (Section 3);
5. there exist relatively strong partial correlations between groups of events (Section 3);
6. certain countries seem, to a certain extent, to yield decathletles with specific profiles (Section 4).
These observations are in our opinion interesting and appear to be statistically significant. Our conclusions
should nevertheless be checked and refined by further analyses, potentially using other sources of data.
6 Appendix
12
Psp
4
0
0
6
0
0
8
0
0
400 600 800
400 600 800
0.72
Pdt
4
0
0
6
0
0
8
0
0
0.44
0.42
400 600 800 1000
4
0
0
6
0
0
8
0
0
1
0
0
0
Pjt
Throw events
Figure 8: Scatter plots, histograms and correlations for the three throw events.
13
We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.
Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.
Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.
Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.
Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.
Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.
We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.
Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.
You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.
Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.
Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.
From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.
Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.
Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.
You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.
You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.
Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.
We create perfect papers according to the guidelines.
We seamlessly edit out errors from your papers.
We thoroughly read your final draft to identify errors.
Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!
Dedication. Quality. Commitment. Punctuality
Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.
We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.
We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.
We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.
We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.