Analyse von Trainingsdaten

Seike Appold, Malte Bartels, Nils Meier, Kai Moltzen & Maximilian Reinhardt

Abstract
The following analysis uses statistical methods to examine training data from 2022 and attempts to identify common patterns. More detailed, the relationship between the type of sports and the aerobic training effect is analyzed as well as the predicting the ATE based on recorded features. Additionally, the data is transformed using a PCA and clustered into segments.
The project was carried out as part of the course “Applied Statistical Data Analysis” in the Master’s program Management & Data Science in the winter semester 2022/23 taught by Henrik von Wehrden.

Training_Analysis_Final

In [1]:

# Import required packages
from time_functions import *
import calendar
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels as sm
import statsmodels.api
from statsmodels.stats import anova
from statsmodels.formula import api
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning)
warnings.filterwarnings(action='ignore', category=FutureWarning)
warnings.filterwarnings(action='ignore', category=RuntimeWarning)
sns.set_style('white')

Von Süda bis doa
A Dach’has schwebd über ons
Malde isch so schee

Introduction¶

The analysis presented in the following paragraphs is based on training data of the year 2022 from an athlete studying in the Management & Data Science Master program at Leuphana university. In every training session, training data was gathered using a smartwatch. The training types include running, yoga, weight training and crosstraining. As the athlete is a competitive runner, the aerobic training effect (ATE) is of particular interest as it indicated progress in the key athletic skills required for peak performance in races.

The following research questions are addressed:

Question 1: When were certain activities performed?
Question 2: How do training types differ in performance parameters?
Question 3: How can running activities be clustered?
Question 4: How is the aerobic training effect explained by other training parameters?

After the initial loading and preprocessing of the data, question 1 is adressed using histograms. Following, to answer question 2, a row of one-way ANOVAs are conducted. Question 3 is approached with a combined approach using principal component (PCA) and cluster analysis. Lastly, to answer question 4, a generalized linear model (GLM) is developed and different model reduction approaches are applied.

The most relevant features include:

Activity type: The type of the activity
Distance: The distance of the activity in km (if recorded for the activity)
Time: The time of the activity
Calories: The calories burned during the activity
Avg. heart rate: The average heart rate during the activity
Max. heart rate: The maximum heart rate during the activity
Avg. pace: The average pace during the activity (if recorded for the activity)
Aerobic TE: The aerobic training effect of the activity, which measures the training load of the activity

Preprocessing¶

Aerober TE
Den Nils um Hilfe bitten
Leerzeichen sind Schmutz

The first step is to load and preprocess the data, in order to make it ready for further analysis. To do so, an iterative approach is used, where the data is repeatedly inspected and cleaned / adjusted, until no further changes are necessary. The preprocessing includes dropping irrelevant columns, transforming the date columns, replacing the -- values with NaN, renaming the columns, and transforming the time columns to numeric values in minutes. Furthermore, zero values in the heart rate columns are replaced with NaN, as this is not possible. Besides that, the data is limited to the most relevant activities. This is done, as all types of analysis based on a real small number of samples are not meaningful. Finally, a subsample containing only running activities is created for further analysis on the running activities.

In [2]:

def load_and_preprocess_dataset(path: str) -> pd.DataFrame:
    # Load the original dataset
    df = pd.read_csv(path, encoding='utf8')

    # Drop all irrelevant columns
    potential_drops = ['Durchschnittliches vertikales Verhältnis', 'Ø vertikale Bewegung', 'Ø Bodenkontaktzeit',
                       'Training Stress Score®', 'Ø Leistung', 'Max. Leistung', 'Grit', 'Flow', 'Ø Swolf',
                       'Ø Schlagrate', 'Tauchzeit', 'Oberflächenpause', 'Dekompression', 'Favorit', 'Titel',
                       'Zeit in Bewegung', 'Verstrichene Zeit', 'Beste Rundenzeit', 'Wiederholungen insgesamt',
                       'Sätze insgesamt']
    df.drop(columns=[c for c in potential_drops if c in df.columns],inplace=True)

    # Transform data date columns and add a new column for the time
    df.Datum = pd.to_datetime(df.Datum)
    df.insert(loc=2, column='Start Time', value=df['Datum'].dt.time)

    # Replace all -- values with NaN
    df.replace('--', np.nan, inplace=True)

    # Rename different types of running activities to 'Laufen'
    df.loc[df['Aktivitätstyp'].isin(['Indoor-Laufen', 'Laufbandtraining']), 'Aktivitätstyp'] = 'Laufen'

    # Transform the time column to numeric values in minutes
    for col in ['Zeit', 'Ø Pace', 'Beste Pace']:
        df[col] = df[col].astype(str).apply(lambda t: from_text_to_numeric(time=t, output='m'))


    # Try to transform all columns to numeric datatype
    for col in df.columns:
        try:
            df[col] = df[col].astype(str).str.replace(',', '').astype(float)
        except:
            pass

    # Rename the columns to replace the Ø sign with Avg
    df.columns = [c.replace('Ø ', 'Avg_').replace(' ', '_') for c in df.columns]

    # Replace 0.0 values in the Avg_Herzfrequenz column with NaN as this is not possible
    df.loc[df['Avg_Herzfrequenz'] == 0.0, 'Avg_Herzfrequenz'] = np.nan

    # Return the DataFrame
    return df

# Show first and last 5 rows
data = load_and_preprocess_dataset('Activities_2022_Cleaned.csv')
pd.concat([data.head(), data.tail()])

Out[2]:

	Aktivitätstyp	Datum	Start_Time	Distanz	Kalorien	Zeit	Avg_Herzfrequenz	Maximale_Herzfrequenz	Aerober_TE	Avg_Schrittfrequenz_(Laufen)	...	Maximale_Temperatur	Avg_Atemfrequenz	Minimale_Atemfrequenz	Maximale_Atemfrequenz	Stressänderung	Stress:_Start	Stress:_Ende	Avg_Stress	Minimale_Höhe	Maximale_Höhe
0	Laufen	2022-12-31 12:58:11	12:58:11	5.00	346.0	25.133333	149.0	171.0	2.2	169.0	...	23.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-43.0	-32.0
1	Laufen	2022-12-31 12:15:08	12:15:08	4.97	309.0	16.500000	184.0	192.0	3.2	180.0	...	23.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-8.0	1.0
2	Laufen	2022-12-31 11:28:19	11:28:19	5.00	351.0	24.150000	154.0	177.0	2.4	167.0	...	28.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-42.0	-33.0
3	Yoga	2022-12-30 17:40:14	17:40:14	NaN	84.0	47.566667	69.0	88.0	0.0	NaN	...	30.0	21.0	12.0	24.0	-40.0	60.0	20.0	30.0	NaN	NaN
4	Laufen	2022-12-30 13:21:52	13:21:52	12.05	786.0	56.450000	148.0	157.0	2.8	168.0	...	28.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	26.0	48.0
674	Laufen	2022-01-04 13:29:51	13:29:51	12.50	815.0	66.133333	146.0	182.0	3.0	151.0	...	28.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	36.0	53.0
675	Yoga	2022-01-03 20:14:01	20:14:01	NaN	58.0	37.433333	66.0	83.0	0.0	NaN	...	30.0	14.0	7.0	23.0	13.0	12.0	25.0	24.0	NaN	NaN
676	Laufen	2022-01-03 14:23:36	14:23:36	8.00	505.0	38.616667	142.0	150.0	2.4	166.0	...	30.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	19.0	28.0
677	Laufen	2022-01-02 14:20:31	14:20:31	16.80	1095.0	71.633333	158.0	171.0	3.6	165.0	...	29.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5.0	51.0
678	Laufen	2022-01-01 14:23:42	14:23:42	12.03	756.0	49.000000	161.0	184.0	3.6	170.0	...	30.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5.0	34.0

10 rows × 28 columns

In [3]:

# rename column Aktivitätstyp to Activity
data.rename(columns={'Aktivitätstyp': 'Activity'}, inplace=True)

In [4]:

# change all values including Laufen in the 'Aktivitätstyp' column to running 
data.loc[data['Activity'].isin(['Laufen']), 'Activity'] = 'Running'
data.loc[data['Activity'].isin(['Krafttraining']), 'Activity'] = 'Strength Training'
data.loc[data['Activity'].isin(['Crosstraining']), 'Activity'] = 'Cross Training'

In [5]:

data['Activity'].value_counts()

Out[5]:

Running              355
Yoga                 181
Strength Training     91
Cross Training        43
Radfahren              3
Atemübung              2
Cardio                 2
Indoor Cycling         1
Gehen                  1
Name: Activity, dtype: int64

In [6]:

data = data[data['Activity'].isin(['Running', 'Yoga', 'Strength Training', 'Cross Training'])]

In [7]:

# Take a subsample including only running activities and select numeric datatypes only
running_data = data[data.Activity == 'Running'].select_dtypes(include=np.number).drop(columns=['Avg_Atemfrequenz', 'Maximale_Atemfrequenz', 'Minimale_Atemfrequenz',
                                                                                                   'Stressänderung', 'Stress:_Start', 'Stress:_Ende', 'Avg_Stress']).dropna()
running_data_std = pd.DataFrame(StandardScaler().fit_transform(running_data), columns=running_data.columns, index=running_data.index)
# rename columns from running_data_std to English
running_data_std.rename(columns={'Distanz': 'Distance', 'Kalorien': 'Calories', 'Zeit': 'Time', 'Avg_Herzfrequenz' : 'Avg_heart_rate', 'Aerober_TE': 'Aerobic_training_effect', 'Maximale_Herzfrequenz': 'Max_Heart_Rate', 'Avg_Schrittfrequenz_(Laufen)': 'Avg_Step_Frequency', 'Max._Schrittfrequenz_(Laufen)': 'Max_Step_Frequency', 'Beste_Pace': 'Max_Pace', 
                                 'Anstieg_gesamt': 'Total_Ascent', 'Abstieg_gesamt': 'Total_Descent', 'Avg_Schrittlänge': 'Avg_Step_Length', 'Minimale_Temperatur': 'Min_Temperature', 'Anzahl_der_Runden': 'No_of_Rounds', 'Maximale_Temperatur': 'Max_Temperature', 'Minimale_Höhe': 'Min_Height', 'Maximale_Höhe': 'Max_Height'}, inplace=True)
# rename columns to English
data.rename(columns={'Kalorien': 'Calories'}, inplace=True)
data.rename(columns={'Zeit': 'Time'}, inplace=True)
data.rename(columns={'Avg_Herzfrequenz': 'Avg_heart_rate'}, inplace=True)
data.rename(columns={'Aerober_TE': 'Aerobic_training_effect'}, inplace=True)

Question 1: When were certain activities performed?¶

In this section, the data is explored in order to get a better understanding of the data and to answer the first research questions. To do so, the approach is shortly described and the results as well as reasoning are presented.

At which time of the day did the athlete train?¶

To answer this question, the start time of the activities is plotted as a histogram. The results show that running activities have three peaks, while the other activities have only one peak. The three peaks of the running activities are in the morning, shortly after lunch and the late afternoon. Yoga is mainly performed in the evening, while cross-training and strength training have a higher overlap within the late afternoon and evening. Overall, one can say that the activities follow a fix pattern and that athletes follow a daily routine.

In [8]:

activities = data['Activity'].unique()

fig, axs = plt.subplots(1, 4, tight_layout=True, figsize=(16, 3))
axs = axs.flatten()
for i, activity in enumerate(activities):
    values = data.query(f'Activity == "{activity}"')['Start_Time'].astype(str).apply(lambda t: from_text_to_numeric(time=t, output='h'))
    sns.histplot(data=values, bins=40, ax=axs[i], color='lightblue')
    title = f'{activity} \n Earliest is {from_numeric_to_text(time=min(values), input_format="h", output="hh:mm")} - ' \
            f'Latest is {from_numeric_to_text(time=max(values), input_format="h", output="hh:mm")}'
    axs[i].set(xlabel='Time of the Day', ylabel='Number of Activities', title=title)

fig.show()

How many kilometers were run per month?¶

To answer this question, the data is grouped by month and the sum of the distance is calculated. The results show that the number of kilometers run per month fluctuates over the year. An increase in spring is visible, followed by a bit less kilometers in summer. In September the distance drops significantly, until a fix level is reached in the winter months. The results show that the training is adapted to the racing season in summer. An intensive build-up phase in the spring is followed by races in the summer with a subsequent season break before the build-up training starts again.

In [9]:

fig, ax = plt.subplots(tight_layout=True, figsize=(16, 4))
values = data[data['Activity'].isin(['Running', 'Indoor-Laufen', 'Laufbandtraining'])].groupby(data.Datum.dt.month)['Distanz'].sum()
sns.barplot(x=values.index, y=values.values, ax=ax, color='lightblue')
ax.set(title='Kilometers per Month', xlabel='Month', ylabel='Kilometers')
ax.set_xticklabels(list(calendar.month_abbr)[1:])
fig.show()

How many kilometers were run per week?¶

The data is grouped by week and the sum of the distance is calculated and plotted. All across the year the results show the pattern that the kilometers run increase over approximately three weeks before they drop for one week. This observation can be explained by the fact that distance athletes follow a training plan based on the idea of increased training load over multiple weeks followed by a so-called deloading week to recover.

In [10]:

values = data[data['Activity'].isin(['Running', 'Indoor-Laufen', 'Laufbandtraining'])].groupby(data.Datum.dt.week)['Distanz'].sum()
fig, ax = plt.subplots(tight_layout=True, figsize=(16, 4))
sns.barplot(x=values.index, y=values.values, ax=ax, color='lightblue')
ax.set(title='Kilometers per Week', xlabel='Week', ylabel='Kilometers')
fig.show()

Question 2: How do training types differ in performance parameters?¶

In the following section a One-Way-Anova is performed to check if there is a significant difference in the physiological measurements between the different activity types. First, boxplots for the features calories, time, average heart rate and ATE are created. Next, the One-Way-Anova is performed and the results are presented. Except for the time, the results show a decrease in the residuals which indicates that the activities have a significant effect on the physiological measurements. Based on that one can conclude that the activities have a different effect on the training load of an athlete and should be considered in the training plan.

In [11]:

# Select the features to be used in the Anova Analysis
selected_features = ['Calories', 'Time', 'Avg_heart_rate', 'Aerobic_training_effect']

# Create a boxplot for each feature
fig, axs = plt.subplots(ncols=len(selected_features), figsize=(14, 4), tight_layout=True)
for i, feature in enumerate(selected_features):
    sns.boxplot(data=data, x='Activity', y=feature, ax=axs[i])
    axs[i].set(xlabel='', ylabel='', title=feature)
    axs[i].set_xticklabels(axs[i].get_xticklabels(), rotation=45)
    if feature == 'Avg_heart_rate':
        axs[i].set(title='Avg. Heart Rate')
    if feature == 'Aerobic_training_effect':
        axs[i].set(title='Aerobic Training Effect')
fig.show()

In [12]:

# Perform an Anova Analysis on Kalorien based on type of activity
model = api.ols(formula='Calories ~ C(Activity)', data=data).fit()
anova = sm.stats.anova.anova_lm(model, typ=2)
anova

Out[12]:

	sum_sq	df	F	PR(>F)
C(Activity)	5.811938e+07	3.0	355.936846	6.897471e-138
Residual	3.624942e+07	666.0	NaN	NaN

In [13]:

# Perform an Anova Analysis on Zeit based on type of activity
model = api.ols(formula='Time ~ C(Activity)', data=data).fit()
anova = sm.stats.anova.anova_lm(model, typ=2)
anova

Out[13]:

	sum_sq	df	F	PR(>F)
C(Activity)	43173.857032	3.0	44.442503	3.451826e-26
Residual	215662.838840	666.0	NaN	NaN

In [14]:

# Perform an Anova Analysis on Avg_Herzfrequenz based on type of activity
model = api.ols(formula='Avg_heart_rate ~ C(Activity)', data=data).fit()
anova = sm.stats.anova.anova_lm(model, typ=2)
anova

Out[14]:

	sum_sq	df	F	PR(>F)
C(Activity)	797934.523613	3.0	4517.781263	0.0
Residual	39033.215518	663.0	NaN	NaN

In [15]:

# Perform an Anova Analysis on Aerober_TE based on type of activity
model = api.ols(formula='Aerobic_training_effect ~ C(Activity)', data=data).fit()
anova = sm.stats.anova.anova_lm(model, typ=2)
anova

Out[15]:

	sum_sq	df	F	PR(>F)
C(Activity)	1196.461300	3.0	1741.074254	8.283084e-314
Residual	151.870574	663.0	NaN	NaN

Question 3: How can running activities be clustered?¶

In order to address this question, a cluster analysis is conducted that reveals how the groups differ in performance parameters. Since many of the input features which are relevant for the clustering of running sessions show correlations in additional exploratory data analyses, e.g. distance, time and calories, a PCA is performed prior to the cluster analysis. Thus, the principal components (PC) which are free of redundancies can be used as input features for the cluster analysis.

In [16]:

# Apply a PCA on the running data
pca = PCA(n_components=len(running_data.columns))
running_pcs = pd.DataFrame(pca.fit_transform(running_data_std), columns=[f'PC_{i + 1}' for i in range(pca.n_components_)],
                           index=running_data.index)

# Plot the explained variance per PC
fig, ax = plt.subplots(figsize=(16, 4), tight_layout=True)
sns.barplot(x=[f'PC{i+1}' for i in range(pca.n_components_)], y=pca.explained_variance_ratio_, color='lightblue', ax=ax)
ax.set(title='Explained Variance per PC', xlabel='PC', ylabel='Explained Variance')
fig.show()

The PCA shows substantial drops in explained variance after the first, second, and fifth PC indicating that the inclusion of the most relevant one, two or five PCs in a model might be reasonable choices. PC1 explains approximately 35%, PC1 and PC2 approximately 55% and PC1-PC5 approximately 85% of the variance in the data. We decided to use PC1 and PC2 as input features for the cluster analysis.

In order to perform cluster analysis, k-means method is applied. Since the solution of k-means clustering is not always optimal and depends on the initialization of cluster means, assignments of datapoints are performed ten times based on randomly chosen initializations from which the best one is chosen.

At first, the optimal number of clusters is determined. This is done by computing clustering solutions for 1 to 15 clusters and comparing the results based on inertia and silhouette scores. Inertia is a commonly used metric measuring the heterogeneity inside of clusters. The silhouette score is a measure taking the heterogeneity inside and between clusters into account (see further readings). After determining an adequate number of clusters, the final cluster assignment is calculated and described.

In [17]:

# Check the scores for different number of clusters to find the optimal number
silhouette_scores = []
inertias = []
for i in range(2, 16):
    cluster_model = KMeans(n_clusters=i, random_state=42, n_init=10)
    cluster = cluster_model.fit_predict(running_pcs[['PC_1', 'PC_2']])
    silhouette_scores.append(silhouette_score(running_pcs[['PC_1', 'PC_2']], cluster))
    inertias.append(cluster_model.inertia_)

scores = pd.DataFrame({'Silhouette Score': silhouette_scores, 'Inertia': inertias},
                      index=range(2, 16))

fig, axs = plt.subplots(1, 2, figsize=(16, 4), tight_layout=True)
fig.suptitle('Silhouette Score and Inertia for different number of clusters', fontsize=16)
sns.barplot(data=scores, x=scores.index, y='Silhouette Score', ax=axs[0], color='lightblue')
sns.lineplot(data=scores, x=scores.index, y='Inertia', ax=axs[1])
axs[0].set(title='', xlabel='Number of Clusters', ylabel='Silhouette Score')
axs[1].set(title='', xlabel='Number of Clusters', ylabel='Inertia')
fig.show()

Inertia and silhouette scores indicate that a 5-cluster-solution would ensure low heterogeneity inside and high heterogeneity between clusters so that the final assignment is performed with five clusters.

In [18]:

# Apply KMeans to the PCs with 5 clusters
cluster_model = KMeans(n_clusters=5, random_state=42, n_init=10)
cluster = cluster_model.fit_predict(running_pcs[['PC_1', 'PC_2']])

fig, ax = plt.subplots(figsize=(8, 6), tight_layout=True)
sns.scatterplot(data=running_pcs, x='PC_1', y='PC_2', hue=cluster.astype(str), ax=ax,
                hue_order=['0', '1', '2', '3', '4'])
ax.set(title='Clustering Results with 5 Clusters', xlabel='PC1', ylabel='PC2')
ax.legend().set_title('Cluster')
fig.show()

In [19]:

# Combine the cluster labels with the original data and plot the average values per cluster
running_data['Cluster'] = cluster
values = running_data.groupby('Cluster').mean()[['Distanz', 'Zeit', 'Avg_Herzfrequenz','Avg_Pace', 'Anzahl_der_Runden']].reset_index()
values = pd.melt(values, id_vars='Cluster', value_vars=['Distanz', 'Zeit', 'Avg_Herzfrequenz','Avg_Pace', 'Anzahl_der_Runden'], var_name='Metric', value_name='Value')

fig, ax = plt.subplots(nrows=1, ncols=5, figsize=(16, 4), tight_layout=True)
fig.suptitle('Average Values per Cluster', fontsize=16)
for i, metric in enumerate(['Distanz', 'Zeit', 'Avg_Herzfrequenz','Avg_Pace', 'Anzahl_der_Runden']):
    sns.barplot(data=values[values.Metric == metric], x='Cluster', y='Value', ax=ax[i], color='lightblue')
    ax[i].set(xlabel='Cluster', ylabel=metric, ylim=(values[values.Metric == metric].Value.min() * 0.8, values[values.Metric == metric].Value.max() * 1.1))
    if metric == 'Distanz':
        ax[i].set(ylabel='Distance')
    if metric == 'Avg_Herzfrequenz':
        ax[i].set(ylabel='Avg. Heart Rate')
    if metric == 'Zeit':
        ax[i].set(ylabel='Time')
    if metric == 'Avg_Pace':
        ax[i].set(ylabel='Avg. Pace')
    if metric == 'Anzahl_der_Runden':
        ax[i].set(ylabel='No. of Rounds')
fig.show()

# Drop the cluster column again
running_data.drop(columns=['Cluster'], inplace=True)

The final clusters are described by the variables distance, time, average heart frequency, average pace and number of rounds. All selected descriptive variables are easy to observe during a training session so that the athlete could relate them to a certain training behaviour. In this way, the clusters could be assigned to the following interpretable short descriptions:

Cluster 0: Standard runs with medium pace and medium heart rate
Cluster 1: Long runs with medium pace and medium heart rate
Cluster 2: Intervall Sessions with high number of rounds
Cluster 3: Short recovery runs with low pace and low heart rate
Cluster 4: High intensity runs with high pace and high heart rate

Question 4: How is the aerobic training effect explained by other training parameters?¶

As described above, the ATE is one of the key parameters of interest to running athletes. GLMs are employed to predict this effect as accurately as possible. To gather first insights into which parameters best approximate ATE, scatterplots using one feature against the ATE are generated. Below, two plots employing average heart rates are presented depicting the different types of activity and duration of the training sessions respectively.

In [20]:

fig, axs = plt.subplots(ncols=2, figsize=(16, 4), tight_layout=True)
sns.scatterplot(data=data, x='Avg_heart_rate', y='Aerobic_training_effect', hue='Activity', ax=axs[0])
axs[0].set(xlabel='Average Heartrate', ylabel='Aerobic Training Effect')
sns.scatterplot(data=data, x='Avg_heart_rate', y='Aerobic_training_effect', hue='Time', ax=axs[1])
axs[1].set(xlabel='Average Heartrate', ylabel='Aerobic Training Effect')
axs[1].legend().set_title('Time in min')
fig.show()

Yoga has no ATE and does not increase the athlete’s heart rate. In contrast, running, cross and strength training generate an ATE and raise the heart rate. While for cross and strength training, the plot indicates a relatively linear relation between heart rate and ATE, the data displays a higher variance for running activities. The right plot shows that the ATE increases proportionately with the duration of training. Due to the high correlations, average heart rate and time (duration) are included in all hypothesis models for GLMs below.

The dataset includes various positive continuous, discrete, and interval data with different distributions assumed, which is why GLMs are applied. This method uses a link function that relates a linear predictor and the assumed probability distribution of the observed variable to predict dependent variables with non-linear relations to the exogeneous, discrete targets, or diverse variance. Poisson probability distribution is used because ATE is a positive, discrete measure. (cf. further readings)

First, a full model is generated on the standardized training data. Hereafter, degrees of freedom (DoF) are reduced iteratively to minimize the risk of overfitting while maintaining as much predictive power as possible, following Occam’s razor. Hence, the models are compared with Akaike Information Criterion (AIC), which incorporates explanatory power and DoF.

To reduce multicollinearity, parameters with correlation coefficients beyond 0.9 are excluded based on a correlation heatmap. Subsequently, the parameters with the highest variance inflation factor (VIF) are dropped iteratively, until all coefficients are below the conventional threshold of 10. (cf. Model reduction)

Full Model including all features¶

In [21]:

def calc_consistent_aic(model_results):
    consistent_aic = -2 * model_results.llf + model_results.df_model * np.log(running_data_std.shape[0]+1)
    return consistent_aic

In [22]:

# GLM using statsmodels library including all features
model_full = sm.api.GLM(endog=running_data_std['Aerobic_training_effect'],
                        exog=running_data_std.drop('Aerobic_training_effect', axis=1),
                        family=sm.api.families.Poisson())
results_full_model = model_full.fit()
print(results_full_model.summary())
print(f'AIC: {results_full_model.aic}')
print(f'Consistent AIC: {calc_consistent_aic(results_full_model)}')

                    Generalized Linear Model Regression Results                    
===================================================================================
Dep. Variable:     Aerobic_training_effect   No. Observations:                  341
Model:                                 GLM   Df Residuals:                      324
Model Family:                      Poisson   Df Model:                           16
Link Function:                         Log   Scale:                          1.0000
Method:                               IRLS   Log-Likelihood:                -320.82
Date:                     Fri, 10 Mar 2023   Deviance:                       10285.
Time:                             22:07:43   Pearson chi2:                 4.41e+04
No. Iterations:                          8   Pseudo R-squ. (CS):                nan
Covariance Type:                 nonrobust                                         
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Distance              -5.9296      0.962     -6.162      0.000      -7.816      -4.044
Calories               1.7058      0.662      2.575      0.010       0.408       3.004
Time                   4.6185      0.661      6.984      0.000       3.322       5.915
Avg_heart_rate        -0.1426      0.128     -1.117      0.264      -0.393       0.108
Max_Heart_Rate         0.0345      0.059      0.585      0.558      -0.081       0.150
Avg_Step_Frequency    -0.7584      0.141     -5.366      0.000      -1.035      -0.481
Max_Step_Frequency    -0.0425      0.078     -0.547      0.584      -0.195       0.110
Avg_Pace              -3.6809      0.465     -7.910      0.000      -4.593      -2.769
Max_Pace              -0.1967      0.083     -2.364      0.018      -0.360      -0.034
Total_Ascent           0.0591      0.129      0.458      0.647      -0.194       0.312
Total_Descent          0.0565      0.124      0.456      0.649      -0.186       0.299
Avg_Step_Length       -1.5602      0.302     -5.169      0.000      -2.152      -0.969
Min_Temperature       -0.0893      0.072     -1.242      0.214      -0.230       0.052
No_of_Rounds          -0.0743      0.079     -0.940      0.347      -0.229       0.081
Max_Temperature        0.1141      0.067      1.699      0.089      -0.018       0.246
Min_Height             0.1100      0.083      1.326      0.185      -0.053       0.273
Max_Height            -0.0496      0.083     -0.601      0.548      -0.211       0.112
======================================================================================
AIC: 675.6454709358195
Consistent AIC: 735.0024427288212

Reduced Model using stepwise reduction¶

In [23]:

# Plot a heatmap of the correlations
corr = running_data.corr()
mask = np.triu(corr)
fig, ax = plt.subplots(figsize=(12, 10), tight_layout=True)
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', mask=mask, ax=ax)
# translate features to English
ax.set_xticklabels(replace_xtick_labels(ax.get_xticklabels()))
ax.set_yticklabels(replace_xtick_labels(ax.get_yticklabels()))

fig.show()

In [24]:

def calculate_vif(input_df):
    vif = pd.DataFrame()
    vif['VIF Factor'] = [variance_inflation_factor(input_df.values, col) for col in range(len(input_df.columns))]
    vif['features'] = input_df.columns
    return vif

In [25]:

running_data_after_vifs = running_data_std.drop(['Aerobic_training_effect', 'Calories', 'Time', 'Avg_Step_Length', 'Total_Descent'], axis=1)
calculate_vif(running_data_after_vifs).sort_values('VIF Factor', ascending=False)

Out[25]:

	VIF Factor	features
5	6.496549	Avg_Pace
3	5.190501	Avg_Step_Frequency
12	4.904805	Max_Height
1	4.818190	Avg_heart_rate
9	4.585780	No_of_Rounds
11	4.112934	Min_Height
4	3.477695	Max_Step_Frequency
0	3.370108	Distance
6	3.295303	Max_Pace
7	2.636315	Total_Ascent
8	2.389099	Min_Temperature
10	2.297693	Max_Temperature
2	2.096246	Max_Heart_Rate

In [26]:

# GLM with features excluded based on correlation coefficient and VIF
model_VIF_reduced = sm.api.GLM(endog=running_data_std['Aerobic_training_effect'],
                           exog=running_data_std.drop(['Aerobic_training_effect', 'Calories', 'Time', 'Avg_Step_Length', 'Total_Descent'], axis=1),
                           family=sm.api.families.Poisson())
results_VIF_reduced_model = model_VIF_reduced.fit()
print(results_VIF_reduced_model.summary())
print(f'AIC: {results_VIF_reduced_model.aic}')
print(f'Consistent AIC: {calc_consistent_aic(results_VIF_reduced_model)}')

                    Generalized Linear Model Regression Results                    
===================================================================================
Dep. Variable:     Aerobic_training_effect   No. Observations:                  341
Model:                                 GLM   Df Residuals:                      328
Model Family:                      Poisson   Df Model:                           12
Link Function:                         Log   Scale:                          1.0000
Method:                               IRLS   Log-Likelihood:                -349.49
Date:                     Fri, 10 Mar 2023   Deviance:                       10178.
Time:                             22:07:45   Pearson chi2:                 3.40e+03
No. Iterations:                          7   Pseudo R-squ. (CS):                nan
Covariance Type:                 nonrobust                                         
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Distance               0.4158      0.072      5.753      0.000       0.274       0.557
Avg_heart_rate        -0.2148      0.113     -1.900      0.057      -0.436       0.007
Max_Heart_Rate         0.0916      0.060      1.537      0.124      -0.025       0.208
Avg_Step_Frequency    -0.1973      0.099     -1.997      0.046      -0.391      -0.004
Max_Step_Frequency     0.0355      0.080      0.446      0.655      -0.120       0.191
Avg_Pace              -0.8032      0.140     -5.737      0.000      -1.078      -0.529
Max_Pace              -0.2814      0.082     -3.415      0.001      -0.443      -0.120
Total_Ascent           0.0542      0.066      0.827      0.408      -0.074       0.183
Min_Temperature       -0.1412      0.074     -1.916      0.055      -0.286       0.003
No_of_Rounds          -0.0009      0.078     -0.011      0.991      -0.153       0.152
Max_Temperature        0.1108      0.069      1.616      0.106      -0.024       0.245
Min_Height            -0.0310      0.084     -0.368      0.713      -0.196       0.134
Max_Height             0.0537      0.082      0.656      0.512      -0.107       0.214
======================================================================================
AIC: 724.9866979871073
Consistent AIC: 769.0044268318585

The athlete’s domain knowledge and correlation considerations are used to construct four model hypotheses for predicting ATE:

The first model includes few parameters that are controllable for the athlete, namely, distance, duration, pace and average heart rate of the run.

The second model additionally uses the environmental factors maximum temperature and total ascent.

The third model contains all features for which the absolute correlation coefficient with respect to ATE is above 0.5. Calories are exempted as they are not measured but calculated based on other parameters, thus, exhibiting very high multicollinearity.

A fourth, minimal model only includes the average heart rates and running times.

Furthermore, GLMs including the first two, five, and all principal components were considered. They are eliminated of multicollinearity and, thus, expected to achieve highest predictive power.

In [27]:

# 1. model with few internal factors
model_1 = sm.api.GLM(endog=running_data_std['Aerobic_training_effect'],
                     exog=running_data_std[['Distance', 'Time', 'Avg_heart_rate', 'Avg_Pace']],
                     family=sm.api.families.Poisson())
results_model_1 = model_1.fit()
print(results_model_1.summary())
print(f'AIC: {results_model_1.aic}')
print(f'Consistent AIC: {calc_consistent_aic(results_model_1)}')

                    Generalized Linear Model Regression Results                    
===================================================================================
Dep. Variable:     Aerobic_training_effect   No. Observations:                  341
Model:                                 GLM   Df Residuals:                      337
Model Family:                      Poisson   Df Model:                            3
Link Function:                         Log   Scale:                          1.0000
Method:                               IRLS   Log-Likelihood:                -354.19
Date:                     Fri, 10 Mar 2023   Deviance:                       10202.
Time:                             22:07:45   Pearson chi2:                 4.20e+03
No. Iterations:                          8   Pseudo R-squ. (CS):                nan
Covariance Type:                 nonrobust                                         
==================================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
Distance          -3.2018      0.481     -6.658      0.000      -4.144      -2.259
Time               3.8260      0.490      7.808      0.000       2.866       4.786
Avg_heart_rate     0.0428      0.090      0.478      0.633      -0.133       0.218
Avg_Pace          -1.0990      0.139     -7.898      0.000      -1.372      -0.826
==================================================================================
AIC: 716.3780627300014
Consistent AIC: 725.8824949411892

In [28]:

# 2. model additionally including environmental factors
model_2 = sm.api.GLM(endog=running_data_std['Aerobic_training_effect'],
                     exog=running_data_std[['Distance', 'Time', 'Avg_heart_rate', 'Avg_Pace', 'Max_Temperature', 'Total_Ascent']],
                     family=sm.api.families.Poisson())
results_model_2 = model_2.fit()
print(results_model_2.summary())
print(f'AIC: {results_model_2.aic}')
print(f'Consistent AIC: {calc_consistent_aic(results_model_2)}')

                    Generalized Linear Model Regression Results                    
===================================================================================
Dep. Variable:     Aerobic_training_effect   No. Observations:                  341
Model:                                 GLM   Df Residuals:                      335
Model Family:                      Poisson   Df Model:                            5
Link Function:                         Log   Scale:                          1.0000
Method:                               IRLS   Log-Likelihood:                -353.65
Date:                     Fri, 10 Mar 2023   Deviance:                       10204.
Time:                             22:07:45   Pearson chi2:                 4.32e+03
No. Iterations:                          8   Pseudo R-squ. (CS):                nan
Covariance Type:                 nonrobust                                         
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Distance           -3.2517      0.481     -6.758      0.000      -4.195      -2.309
Time                3.8496      0.488      7.893      0.000       2.894       4.806
Avg_heart_rate      0.0283      0.091      0.312      0.755      -0.150       0.206
Avg_Pace           -1.1192      0.141     -7.965      0.000      -1.395      -0.844
Max_Temperature    -0.0106      0.042     -0.251      0.802      -0.093       0.072
Total_Ascent        0.0450      0.046      0.969      0.333      -0.046       0.136
===================================================================================
AIC: 719.2943070907802
Consistent AIC: 736.4683607760932

In [29]:

# 3. model including features with highest corr. coefficients
model_3 = sm.api.GLM(endog=running_data_std['Aerobic_training_effect'],
                     exog=running_data_std[['Distance', 'Time', 'Avg_heart_rate', 'Max_Heart_Rate', 'Avg_Pace', 'Avg_Step_Length', 'Total_Ascent', 'No_of_Rounds']],
                     family=sm.api.families.Poisson())
results_model_3 = model_3.fit()
print(results_model_3.summary())
print(f'AIC: {results_model_3.aic}')
print(f'Consistent AIC: {calc_consistent_aic(results_model_3)}')

                    Generalized Linear Model Regression Results                    
===================================================================================
Dep. Variable:     Aerobic_training_effect   No. Observations:                  341
Model:                                 GLM   Df Residuals:                      333
Model Family:                      Poisson   Df Model:                            7
Link Function:                         Log   Scale:                          1.0000
Method:                               IRLS   Log-Likelihood:                -348.29
Date:                     Fri, 10 Mar 2023   Deviance:                       10216.
Time:                             22:07:45   Pearson chi2:                 5.78e+03
No. Iterations:                          7   Pseudo R-squ. (CS):                nan
Covariance Type:                 nonrobust                                         
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Distance           -3.2226      0.576     -5.594      0.000      -4.352      -2.094
Time                3.7084      0.596      6.227      0.000       2.541       4.876
Avg_heart_rate     -0.1350      0.115     -1.176      0.240      -0.360       0.090
Max_Heart_Rate      0.1566      0.055      2.854      0.004       0.049       0.264
Avg_Pace           -1.4705      0.252     -5.838      0.000      -1.964      -0.977
Avg_Step_Length    -0.2199      0.180     -1.222      0.222      -0.573       0.133
Total_Ascent        0.0348      0.047      0.734      0.463      -0.058       0.128
No_of_Rounds        0.1133      0.065      1.750      0.080      -0.014       0.240
===================================================================================
AIC: 712.5789957498855
Consistent AIC: 737.4226709093238

In [30]:

# 4. minimal model
model_4 = sm.api.GLM(endog=running_data_std['Aerobic_training_effect'],
                     exog=running_data_std[['Avg_heart_rate', 'Time']],
                     family=sm.api.families.Poisson())
results_model_4 = model_4.fit()
print(results_model_4.summary())
print(f'AIC: {results_model_4.aic}')
print(f'Consistent AIC: {calc_consistent_aic(results_model_4)}')

                    Generalized Linear Model Regression Results                    
===================================================================================
Dep. Variable:     Aerobic_training_effect   No. Observations:                  341
Model:                                 GLM   Df Residuals:                      339
Model Family:                      Poisson   Df Model:                            1
Link Function:                         Log   Scale:                          1.0000
Method:                               IRLS   Log-Likelihood:                -385.65
Date:                     Fri, 10 Mar 2023   Deviance:                       10111.
Time:                             22:07:45   Pearson chi2:                 1.16e+03
No. Iterations:                          7   Pseudo R-squ. (CS):                nan
Covariance Type:                 nonrobust                                         
==================================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
Avg_heart_rate     0.4338      0.038     11.408      0.000       0.359       0.508
Time               0.5742      0.043     13.440      0.000       0.490       0.658
==================================================================================
AIC: 775.3049454834963
Consistent AIC: 777.139756220559

In [31]:

# 5. model: PC1, PC2
model_5 = sm.api.GLM(endog=running_data_std['Aerobic_training_effect'],
                     exog=running_pcs[['PC_1', 'PC_2']],
                     family=sm.api.families.Poisson())
results_model_5 = model_5.fit()
print(results_model_5.summary())
print(f'AIC: {results_model_5.aic}')
print(f'Consistent AIC: {calc_consistent_aic(results_model_5)}')

                    Generalized Linear Model Regression Results                    
===================================================================================
Dep. Variable:     Aerobic_training_effect   No. Observations:                  341
Model:                                 GLM   Df Residuals:                      339
Model Family:                      Poisson   Df Model:                            1
Link Function:                         Log   Scale:                          1.0000
Method:                               IRLS   Log-Likelihood:                -374.08
Date:                     Fri, 10 Mar 2023   Deviance:                       10121.
Time:                             22:07:45   Pearson chi2:                 1.42e+03
No. Iterations:                          7   Pseudo R-squ. (CS):                nan
Covariance Type:                 nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
PC_1           0.3003      0.017     17.947      0.000       0.268       0.333
PC_2           0.0994      0.022      4.591      0.000       0.057       0.142
==============================================================================
AIC: 752.1610385773873
Consistent AIC: 753.99584931445

In [32]:

# 6. model: PC1, PC2, PC3, PC4, PC5
model_6 = sm.api.GLM(endog=running_data_std['Aerobic_training_effect'],
                     exog=running_pcs[['PC_1', 'PC_2', 'PC_3', 'PC_4', 'PC_5']],
                     family=sm.api.families.Poisson())
results_model_6 = model_6.fit()
print(results_model_6.summary())
print(f'AIC: {results_model_6.aic}')
print(f'Consistent AIC: {calc_consistent_aic(results_model_6)}')

                    Generalized Linear Model Regression Results                    
===================================================================================
Dep. Variable:     Aerobic_training_effect   No. Observations:                  341
Model:                                 GLM   Df Residuals:                      336
Model Family:                      Poisson   Df Model:                            4
Link Function:                         Log   Scale:                          1.0000
Method:                               IRLS   Log-Likelihood:                -363.39
Date:                     Fri, 10 Mar 2023   Deviance:                       10130.
Time:                             22:07:45   Pearson chi2:                 1.61e+03
No. Iterations:                          7   Pseudo R-squ. (CS):                nan
Covariance Type:                 nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
PC_1           0.3235      0.018     18.248      0.000       0.289       0.358
PC_2           0.0945      0.020      4.747      0.000       0.055       0.134
PC_3          -0.1162      0.026     -4.412      0.000      -0.168      -0.065
PC_4           0.0323      0.032      1.018      0.309      -0.030       0.095
PC_5           0.0230      0.036      0.642      0.521      -0.047       0.093
==============================================================================
AIC: 736.7746597635276
Consistent AIC: 750.113902711778

In [33]:

# 7. model: all principal components
model_7 = sm.api.GLM(endog=running_data_std['Aerobic_training_effect'],
                     exog=running_pcs[[f'PC_{i+1}' for i in range(pca.n_components_)]],
                     family=sm.api.families.Poisson())
results_model_7 = model_7.fit()
print(results_model_7.summary())
print(f'AIC: {results_model_7.aic}')
print(f'Consistent AIC: {calc_consistent_aic(results_model_7)}')

                    Generalized Linear Model Regression Results                    
===================================================================================
Dep. Variable:     Aerobic_training_effect   No. Observations:                  341
Model:                                 GLM   Df Residuals:                      323
Model Family:                      Poisson   Df Model:                           17
Link Function:                         Log   Scale:                          1.0000
Method:                               IRLS   Log-Likelihood:                -298.49
Date:                     Fri, 10 Mar 2023   Deviance:                       10303.
Time:                             22:07:45   Pearson chi2:                 1.50e+05
No. Iterations:                          8   Pseudo R-squ. (CS):                nan
Covariance Type:                 nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
PC_1           0.4752      0.022     21.485      0.000       0.432       0.519
PC_2           0.1828      0.024      7.506      0.000       0.135       0.231
PC_3          -0.0976      0.026     -3.689      0.000      -0.149      -0.046
PC_4           0.0649      0.031      2.112      0.035       0.005       0.125
PC_5          -0.0137      0.034     -0.400      0.689      -0.081       0.054
PC_6          -0.2582      0.046     -5.653      0.000      -0.348      -0.169
PC_7          -0.3425      0.060     -5.667      0.000      -0.461      -0.224
PC_8           0.6934      0.091      7.602      0.000       0.515       0.872
PC_9          -0.0358      0.089     -0.403      0.687      -0.210       0.138
PC_10         -0.3612      0.098     -3.679      0.000      -0.554      -0.169
PC_11         -0.1475      0.095     -1.548      0.122      -0.334       0.039
PC_12         -0.6154      0.129     -4.776      0.000      -0.868      -0.363
PC_13         -0.1211      0.117     -1.033      0.302      -0.351       0.109
PC_14         -1.3104      0.180     -7.298      0.000      -1.662      -0.958
PC_15         -0.1532      0.162     -0.943      0.346      -0.471       0.165
PC_16         -1.8616      0.470     -3.961      0.000      -2.783      -0.940
PC_17         -5.5622      0.720     -7.730      0.000      -6.972      -4.152
PC_18          7.0885      1.153      6.150      0.000       4.829       9.348
==============================================================================
AIC: 632.9732512000604
Consistent AIC: 696.1650337301247

Based on the AIC, the full PC model achieves the best performance, followed by the full model, and hypothesis model containing the features with highest correlation coefficients with respect to ATE.

The good performance of the full model suggests that the AIC may not sufficiently punish the number of features. To overcome this limitation, Bozdogan (1987) proposes the Consistent AIC (CAIC) for model selection. CAIC “increases the ‘penalty term’” (p. 358), thus leading to generally lower dimensional models to avoid overfitting.

Based on the manually calculated CAIC, the full PC model achieves the best performance but, in contrast to the previous findings, the model with few, controllable factors outperforms the full model. Adding the environmental factors worsens the performance to the level of the full model, which also applies to the model containing only the features with the highest ATE-correlation coefficients. Still, the minimal model performs worst.

Thus, the model selection based on Poisson regression and CAIC suggests that the best ATE predictors are distance, time, average heart rate, and average pace.

Conclusion¶

To analyze the training data of a track athlete, this report uses exploratory data analysis, one-way-anova, pca, clustering and model selection for GLM. Based on the research questions, this analysis yields the following results:

Question 1: The majority of trainings was performed in the late afternoon. The monthly kilometers are consistent over the year, but dropped in September. The weekly distribution show increasing training loads paired with recovery weeks.
Question 2: The training types differ significantly in calories, average heart rate and ATE. Running activities exhibit the highest mean value but also the highest variations.
Question 3: Running activities could be divided into five different groups with respective characteristics and impact on the training load.
Question 4: The best hypothesis based on CAIC contains the features distance, time, average heart rate and average pace. However, the best performance is achieved by the full PC model.

Overall, the insights generated through this statistical analysis may serve to assist athletes in their race preparation.

We run together
Cardiac arrest times three
Performance is peak

Referencs and further readings¶

Inertia and silhouette score: https://towardsdatascience.com/silhouette-or-elbow-that-is-the-question-a1dda4fb974
GLM: https://towardsdatascience.com/generalized-linear-models-9cbf848bb8ab,
GLM: https://www.researchgate.net/publication/235726158
AIC and CAIC: Bozdogan, H. (1987). Model selection and Akaike's Information Criterion (AIC): The general theory and its analytical extensions. Psychometrika 52, 345–370. https://doi.org/10.1007/BF02294361
Schwäbischer Haiku: https://www.topster.de/deutsch-schwaebisch/

Cookie	Dauer	Beschreibung
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Analyse von Trainingsdaten

Introduction¶

Preprocessing¶

Question 1: When were certain activities performed?¶

At which time of the day did the athlete train?¶

How many kilometers were run per month?¶

How many kilometers were run per week?¶

Question 2: How do training types differ in performance parameters?¶

Question 3: How can running activities be clustered?¶

Question 4: How is the aerobic training effect explained by other training parameters?¶

Full Model including all features¶

Reduced Model using stepwise reduction¶

Conclusion¶

Referencs and further readings¶

LG4ML

Navigation

Folge uns auf

Analyse von Trainingsdaten

Introduction¶

Preprocessing¶

Question 1: When were certain activities performed?¶

At which time of the day did the athlete train?¶

How many kilometers were run per month?¶

How many kilometers were run per week?¶

Question 2: How do training types differ in performance parameters?¶

Question 3: How can running activities be clustered?¶

Question 4: How is the aerobic training effect explained by other training parameters?¶

Full Model including all features¶

Reduced Model using stepwise reduction¶

Conclusion¶

Referencs and further readings¶

LG4ML

Navigation

Folge uns auf

Cookie-Hinweis