Steam video games analysis

14 minute read

Steam Video Games Analysis

Below are some quick results of an exploratory analysis of Steam Data Set available on Kaggle. The dataset contains a single table with four columns: user_id, game_title, behaviour_name, values. Data description:

“Steam is the world’s most popular PC Gaming hub. With a massive collection that includes everything from AAA blockbusters to small indie titles, great discovery tools can be super valuable for Steam. How can we make them better? This dataset is a list of user behaviors, with columns: user-id, game-title, behavior-name, value. The behaviors included are ‘purchase’ and ‘play’. The value indicates the degree to which the behavior was performed - in the case of ‘purchase’ the value is always 1, and in the case of ‘play’ the value represents the number of hours the user has played the game.”

Analysis

The analysis will be broken into 3 parts: cleaning the data, game analysis, and players behavior analysis. I will attempt to find the most and the least popular games, distributions of users and hours played, some patterns in players behavior.

Cleaning the data

First, let’s import the dataset and load some libraries:

steam.200k <- read.csv("~/Desktop/wargaming/analysis/steam-200k.csv", header=FALSE)
library(ggplot2)       #graphics
library(dplyr)
library(RColorBrewer)
library(knitr)
library(kableExtra)    #nicely formated html tables
library(plotly)        #graphics
library(waffle)        #waffle charts
library(psych) #to call describe function
library(scales) #percentage scale for quantile chart

Next, write the raw data into a new data frame df. It’s always a good practice to keep initial dataset intact. I might need to refer to it at later stages of the analysis and/or use it to verify that the data hasn’t been messed up in the process.

Anyways, let’s look inside the newly created df:

df<-steam.200k

#options:
#dim(df)            # dims
#str(df)            # data description
#summary(df)        #summary statistics

#view the top 5 rows of the data frame
knitr::kable(head(df, 5), format="html") %>% kable_styling(bootstrap_options = "striped", full_width = F)

V1	V2	V3	V4
151603712	The Elder Scrolls V Skyrim	purchase	1
151603712	The Elder Scrolls V Skyrim	play	273
151603712	Fallout 4	purchase	1
151603712	Fallout 4	play	87
151603712	Spore	purchase	1

Column V5 looks suspicious. Summary shows it indeed contains only 0s, thus is meaningless and can be removed.

summary(df$V5)
df<-df[,-5]

Rename all other columns

df<- df %>% rename(player_id = V1,
                   game = V2,
                   behavior = V3,
                   value = V4)
knitr::kable(head(df, 5), format="html") %>% kable_styling(bootstrap_options = "striped", full_width = F)

player_id	game	behavior	value
151603712	The Elder Scrolls V Skyrim	purchase	1
151603712	The Elder Scrolls V Skyrim	play	273
151603712	Fallout 4	purchase	1
151603712	Fallout 4	play	87
151603712	Spore	purchase	1

From the data description, I know that for any unique player_id, if the value of behavior ==“played”, then there will be two rows for the same game. If the game was purchased but not played, there will be only one row for each game with behavior == “purchased”. Let’s arrange this information in a more convenient form. Create two new columns played and purchased filled with 0s and 1s, respectively.

df1<-df%>% mutate(played =0, purchased =1L)

#rearrange df
df2<-df1 %>%
         group_by(player_id, game)%>% # group by unique id's and game
         mutate(group_count=n()) %>%  #create a temporary column with the group count. If the group count is 2, we'll delete one row (the came is both purchased and played) and put this info into respective columns. Otherwise we'll keep a single row (the game has been purchased but not played)
         ungroup%>%
         mutate(played=if_else(behavior=="play", value,0))%>% #fill in played column with the hours, if the game was played. otherwise the value will be 0.
         filter(behavior == "play" | group_count == 1) #filter the rows with the behavior == "play" or group_count==1 (which means purchased but not played)

df3<-df2[,-c(3,4,7)] # remove extra columns behavior, value, group_count

df_clean<-df3
knitr::kable(head(df_clean), format="html") %>% kable_styling(bootstrap_options = "striped", full_width = F) #take a look at the result

player_id	game	played	purchased
151603712	The Elder Scrolls V Skyrim	273.0	1
151603712	Fallout 4	87.0	1
151603712	Spore	14.9	1
151603712	Fallout New Vegas	12.1	1
151603712	Left 4 Dead 2	8.9	1
151603712	HuniePop	8.5	1

Game Analysis

#create a data frame with the game stats
df_game<-df_clean%>%
         group_by(game) %>%   #group by game names
         summarise(count=n(), # count the number of game purchases/players
                   total_hrs=sum(played), #count the total gaming time per game
                   ave_hrs=mean(played)) %>% #average gaming time (total time/count)
         mutate(total_years = round(total_hrs/(24*365), digits=2)) #translate total time into years

options(repr.plot.width=4, repr.plot.height=5)

Check some summary statistics for the game data

des <-describe(df_game)
print(des, digits = 2)

##             vars    n    mean       sd  median trimmed     mad min
## game*          1 5155 2578.00  1488.26 2578.00 2578.00 1911.07   1
## count          2 5155   24.91   101.35    5.00    9.93    5.93   1
## total_hrs      3 5155  668.35 14990.06    3.90   27.50    5.78   0
## ave_hrs        4 5155    6.78    29.43    0.78    2.14    1.15   0
## total_years    5 5155    0.08     1.71    0.00    0.00    0.00   0
##                   max     range  skew kurtosis     se
## game*         5155.00   5154.00  0.00    -1.20  20.73
## count         4841.00   4840.00 25.99  1066.14   1.41
## total_hrs   981684.60 981684.60 56.89  3602.33 208.78
## ave_hrs       1295.00   1295.00 21.04   758.01   0.41
## total_years    112.06    112.06 56.88  3601.79   0.02

sum<-rbind(summary(df_game$count),summary(df_game$total_hrs),summary(df_game$ave_hrs))
name =c('count','total_hrs','ave_hrs')
sum=cbind(name, sum)
sum

##      name        Min. 1st Qu. Median  Mean
## [1,] "count"     "1"  "2"     "5"     "24.9064985451018"
## [2,] "total_hrs" "0"  "0"     "3.9"   "668.354180407371"
## [3,] "ave_hrs"   "0"  "0"     "0.775" "6.7770375943326"
##      3rd Qu.            Max.
## [1,] "18"               "4841"
## [2,] "37.9"             "981684.6"
## [3,] "3.99010263929619" "1295"

What are the most popular games?

First of all, the answer would depend on we one define “popularity”. Let’s start by looking at the top 20 games by the number of purchases (users) and the top 20 by total time played.

options(repr.plot.width=4, repr.plot.height=5)
plot1<- df_game %>% arrange(desc(count)) %>% slice(1:20)%>%
        ggplot(aes(x=reorder(game, count), y=count, fill=count))+geom_bar(stat = "identity")+
        geom_text(aes(label=count), hjust= -0.5, size=2, position = position_dodge(width = 0.5),inherit.aes = TRUE)+
        scale_fill_gradient2()+
        coord_flip(ylim=c(0,5500))+labs(x = '', y = 'users', title = 'Top 20 most purchased games')

plot2<- df_game %>% arrange(desc(total_hrs)) %>% slice(1:20)%>%
        ggplot(aes(x=reorder(game, total_hrs), y=total_years, fill=total_years))+geom_bar(stat = "identity")+
        geom_text(aes(label=total_years), hjust= -0.5, size=2, position = position_dodge(width = 0.5),inherit.aes = TRUE)+
        scale_color_grey()+
        coord_flip(ylim=c(0,140))+labs(x = '', y = 'years', title = 'Top 20 games by years played')

plot1

plot2

Okay. It seems, there are several games that appear on both charts. Let’s find out which are these games exactly. For doing this I use left_join() command, create a new table with the resulting values, and add columns genre and mode (with the information manually collected from Wikipedia)

#find most popular games by both criteria
userbased<-df_game %>% arrange(desc(count)) %>% slice(1:20)
hourbased<- df_game %>% arrange(desc(total_hrs)) %>% slice(1:20)

popular<-left_join(userbased, hourbased, by = 'game')%>%na.omit()

genre<-c('MOBA', '1st perston shooter','1st perston shooter','1st perston shooter','1st perston shooter','1st perston shooter', '3rd perston shooter', 'sandbox', 'action role playing', 'strategy')
mode<-c('both','both', 'multiplayer', 'multiplayer', 'both', 'multiplayer','both','both','single player', 'both')
popular2<-cbind(popular, genre, mode)

game	count	total_hrs	ave_hrs	total_years	genre	mode
Dota 2	4841	981684.6	202.78550	112.06	MOBA	both
Team Fortress 2	2323	173673.3	74.76251	19.83	1st perston shooter	both
Counter-Strike Global Offensive	1412	322771.6	228.59178	36.85	1st perston shooter	multiplayer
Counter-Strike Source	978	96075.5	98.23671	10.97	1st perston shooter	multiplayer
Left 4 Dead 2	951	33596.7	35.32776	3.84	1st perston shooter	both
Counter-Strike	856	134261.1	156.84708	15.33	1st perston shooter	multiplayer
Warframe	847	27074.6	31.96529	3.09	3rd perston shooter	both
Garry’s Mod	731	49725.3	68.02367	5.68	sandbox	both
The Elder Scrolls V Skyrim	717	70889.3	98.86932	8.09	action role playing	single player
Sid Meier’s Civilization V	596	99821.3	167.48540	11.40	strategy	both

Now, let’s see how the most popular games (in terms of number of players) are distributed by genre and mode. Instead of conventional pie charts, I’ll draw waffles :)

Shooters and MOBA genres together amount to over 80% of players of the top 10 games. And the popular games attracting the most users are those with both (multi- and single-player) mode.

waffle2<-popular2%>%group_by(genre)%>%mutate(count_mode=sum(count))%>%select(c("genre", "count_mode"))%>%unique()
waffle2$percent<-waffle2$count_mode/sum(waffle2$count_mode)*100

waffle1<-popular2%>%group_by(mode)%>%mutate(count_mode=sum(count))%>%select(c("mode", "count_mode"))%>%unique()
waffle1$percent<-waffle1$count_mode/sum(waffle1$count_mode)*100

iron(
  waffle(c(`MOBA (34%)` =34, `1st person shooter (45.8%)`=45.8, `3rd person shooter(5.9%) `= 5.9, `sandbox (5.1%)` =5.1, `action role playing (5%)` =5, `strategy (4.2%)` =4.2), rows = 6, title ="Number of players in 10 most popular games by genre"),

  waffle(c(`both (72.2%)` =72.2, `multiplayer (22.8%)`=22.8, `singleplayer (5%) `= 5), rows = 6, title ="Number of players in 10 most popular games by mode")
)

Ready to explore the least popular games?

Firstly, it’s interesting to see how many games were purchased and never played.

notpl<-df_game%>%filter(df_game$total_hrs == 0)
dim(notpl)

## [1] 1555    5

summary(notpl)

##                                game          count           total_hrs
##  16 Bit Arena                    :   1   Min.   :  1.000   Min.   :0
##  1701 A.D. Gold Edition          :   1   1st Qu.:  1.000   1st Qu.:0
##  1953 - KGB Unleashed            :   1   Median :  2.000   Median :0
##  1954 Alcatraz                   :   1   Mean   :  7.273   Mean   :0
##  1Heart                          :   1   3rd Qu.:  5.000   3rd Qu.:0
##  3DMark API Overhead feature test:   1   Max.   :378.000   Max.   :0
##  (Other)                         :1549
##     ave_hrs   total_years
##  Min.   :0   Min.   :0
##  1st Qu.:0   1st Qu.:0
##  Median :0   Median :0
##  Mean   :0   Mean   :0
##  3rd Qu.:0   3rd Qu.:0
##  Max.   :0   Max.   :0
##

Wow, the number of the “never played” games is 1555, or 30% of the total number of games in our dataset! The median number of purchases for the “never played” games is only 2, the maximum is 378. Unfortunately, we cannot really tell the new/just issued games from the unpopular once, since there’s no time indicators whatsoever.

Let’s find out which games have been purchased the maximum number of times, but never played.

plot3<- df_game %>% filter(df_game$total_hrs == 0) %>% arrange(desc(count)) %>% slice(1:20)%>%
        ggplot(aes(x=reorder(game, count), y=count, fill=count))+geom_bar(stat = "identity")+
        geom_text(aes(label=count), hjust= -0.5, size=2, position = position_dodge(width = 0.5),inherit.aes = TRUE)+
         scale_fill_gradient2(low = "white", mid = "grey", high = "black")+
        coord_flip(ylim=c(0,450))+labs(x = '', y = 'number of purchases', title = 'Top 20  purchased & never played games')
plot3

Quick search shows that some of these games are proper games, while others are add-ons/ expansion packs/patches. Clearly, this means our data-set is non homogeneous. The presence of add-ons explains high number of 0s in hours played column, which probably will cause certain problems with distributions later on. Would be nice to have a proper classification variable for the games for deeper analysis.

Distributions

Based on the results above, we filter out the values with number of hours == 0. The distribution of total and average number of hours played are close to log-normal. The number of purchases by game doesn’t get “normal” after taking logs. It looks like some power law distribution (check a histogram in levels). Power law distribution is used for modeling internet sites with x users, for example. Looks like the idea can be applied to games. One can further investigate the issue with KS-test. There’s large mass concentrated around 0 (very small number of purchases), can be caused by the above mentioned add-ons issue, and the fact that we don’t know any time variables (some games might have just aired).

A very nice reference on Random Number Distributions

#histogram of players  distribution
d1<-df_game%>%ggplot(aes(x=count))+geom_histogram(bins=50)+labs(title="Players by game distribution")
d11<-df_game%>%ggplot(aes(x=log10(count)))+geom_histogram(bins=50)+labs(title="Players by game distribution")
d12<-df_game%>%ggplot(aes(x=log10(count)))+geom_histogram(aes(y=..density..),bins=25, colour="black", fill="white")+labs(title="count by game distribution")+geom_density(alpha=.2, fill="#0040FF")
d13<-df_game%>%filter(total_hrs!=0)%>%ggplot(aes(x=log10(count)))+geom_histogram(aes(y=..density..),bins=25, colour="black", fill="white")+labs(title="Count by game distribution")+geom_density(alpha=.2, fill="#0040FF")+geom_vline(aes(xintercept=mean(log10(count), na.rm=T)), color="red", linetype="dashed", size=1)


#histogram of tot_hours  distribution
d2<-df_game%>%ggplot(aes(x=total_hrs))+geom_histogram(bins=50)+labs(title="total hours by game distribution")
d21<-df_game%>%ggplot(aes(x=log10(total_hrs)))+geom_histogram(bins=25)+labs(title="total hours by game distribution") #r removes entries with 0 hours played
d22<-df_game%>%filter(total_hrs!=0)%>%ggplot(aes(x=log10(total_hrs)))+
        geom_histogram(aes(y=..density..),bins=25, colour="black", fill="white")+
        labs(title="total hours by game distribution")+geom_density(alpha=.2, fill="#0040FF")+
        geom_vline(aes(xintercept=mean(log10(total_hrs), na.rm=T)), color="red", linetype="dashed", size=1)


#histogram of ave_hours  distribution
d3<-df_game%>%ggplot(aes(x=ave_hrs))+geom_histogram(bins=50)+labs(title="average hours by game distribution")
d31<-df_game%>%ggplot(aes(x=log10(ave_hrs)))+geom_histogram(bins=25)+labs(title="average hours by game distribution") #r removes entries with 0 hours played
d32<-df_game%>%filter(ave_hrs!=0)%>%ggplot(aes(x=log10(ave_hrs)))+geom_histogram(aes(y=..density..),bins=25, colour="black", fill="white")+
        labs(title="average hours by game distribution")+geom_density(alpha=.2, fill="#0040FF")+geom_vline(aes(xintercept=mean(log10(ave_hrs), na.rm=T)), color="red", linetype="dashed", size=1)
iron(d13, d22, d32)

The next graph is a very good illustration of 20/80 rule (Pareto principle): top 20% of games accumulate 80% of the players.

q1 <- df_game %>%
        mutate(rank = min_rank(count)) %>%
        arrange(rank) %>%
        mutate(index = row_number(),
               game_quantile = index/n(),
               total_player_quantile = cumsum(count)/sum(count)) %>%
        ggplot(aes(x = game_quantile, y = total_player_quantile)) +
        geom_area(fill = "skyblue3") +
        scale_x_continuous(labels = percent) +
        scale_y_continuous(labels = percent) +
        theme_light() +
        labs(title = "players/game percentage accumulation graph")+ geom_hline(aes(yintercept=.2), color="black", linetype="dashed", size=.5)+ geom_vline(aes(xintercept=.8), color="black", linetype="dashed", size=.5)

q2 <- df_game %>%
        mutate(rank = min_rank(total_hrs)) %>%
        arrange(rank) %>%
        mutate(index = row_number(),
               game_quantile = index/n(),
               total_hrs_quantile = cumsum(total_hrs)/sum(total_hrs)) %>%
        ggplot(aes(x = game_quantile, y = total_hrs_quantile)) +
        geom_area(fill = "skyblue3") +
        scale_x_continuous(labels = percent) +
        scale_y_continuous(labels = percent) +
        theme_light() +
        labs(title = "total hrs/game percentage accumulation graph")

q3 <- df_game %>%
        mutate(rank = min_rank(ave_hrs)) %>%
        arrange(rank) %>%
        mutate(index = row_number(),
               game_quantile = index/n(),
              average_hrs_quantile = cumsum(ave_hrs)/sum(ave_hrs)) %>%
        ggplot(aes(x = game_quantile, y = average_hrs_quantile)) +
        geom_area(fill = "skyblue3") +
        scale_x_continuous(labels = percent) +
        scale_y_continuous(labels = percent) +
        theme_light() +
        labs(title = "average hrs/game percentage accumulation graph")

iron(q1, q2, q3)

Behavior Analysis

#Begin player's analysis by creating a new dataframe with unique players' IDs in the 1st column.
df_play<-df_clean%>%
         group_by(player_id) %>%   #group by player ID
         summarise(count=n(), # number of games per player
                   total_hrs=sum(played), #общее игровое время игрока
                   ave_hrs=mean(played)) %>% #среднее игровое время игрока
         mutate(total_years = round(total_hrs/(24*365), digits=2))

df_play1<-df_play%>%arrange(desc(count))
knitr::kable(head(df_play1, 10), format="html") %>% kable_styling(bootstrap_options = "striped", full_width = F)

player_id	count	total_hrs	ave_hrs	total_years
62990992	1061	5724.8	5.395664	0.65
33865373	774	2274.4	2.938501	0.26
30246419	752	1219.7	1.621941	0.14
58345543	665	1374.9	2.067519	0.16
76892907	597	2731.3	4.575042	0.31
20772968	590	1395.0	2.364407	0.16
11403772	584	6009.9	10.290925	0.69
64787956	579	1222.3	2.111054	0.14
22301321	562	1460.8	2.599288	0.17
47457723	554	4844.3	8.744224	0.55

A player has on average 10 games purchased, plays on average 278 hrs in total, and 55 hrs per game. The data looks over dispersed. Negative binomial distribution could be one of the popular options for modelling count data such as games by player (it’s derived from the Poisson when the mean itself distributed with gamma distribution. This allows for heterogeneity in a population, and is thus for over dispersion). Again - KS test should give more hints. The distribution of hours per player look roughly log-normal.

des <-describe(df_play)
print(des, digits = 2)

##             vars     n        mean          sd      median      trimmed
## player_id      1 12392 1.61126e+08 80007568.18 164933610.5 161266735.53
## count          2 12392 1.03600e+01       35.48         2.0         3.60
## total_hrs      3 12392 2.78030e+02      760.30        13.5        92.91
## ave_hrs        4 12392 5.54900e+01      241.31         5.7        15.34
## total_years    5 12392 3.00000e-02        0.09         0.0         0.01
##                     mad  min          max        range  skew kurtosis
## player_id   92918954.22 5250 309903146.00 309897896.00 -0.05    -0.92
## count              1.48    1      1061.00      1060.00 10.74   175.55
## total_hrs         20.02    0     11754.00     11754.00  5.50    43.17
## ave_hrs            8.25    0     11754.00     11754.00 17.81   581.98
## total_years        0.00    0         1.34         1.34  5.48    43.03
##                    se
## player_id   718721.06
## count            0.32
## total_hrs        6.83
## ave_hrs          2.17
## total_years      0.00

How many “heavy gamers” there’re among players?

Define “heavy gamers” as follows:

total hours > mean + st.dev. = 278+760.3=1038.3
average hours > mean + st.dev. = 55.4+241 = 296.4
number of games > mean + st.dev = 10.36+35.46 = 45.8

“Heavy players”

hg1<-nrow(df_play%>%filter(total_hrs>1040))
hg2<-nrow(df_play%>%filter(ave_hrs>300))
hg3<-nrow(df_play%>%filter(count>46))

cat("by total hours played:", hg1)

## by total hours played: 986

cat('by average hours played per game:', hg2)

## by average hours played per game: 483

cat('by number of games:', hg3)

## by number of games: 565

The most “heavy” gamer

Someone has an average playing time per game = total time = 11754.00. Find ID and which game he/she plays, and the game he plays:

bestpl<-df_play[df_play$total_hrs==max(df_play$total_hrs), "player_id"]
bestpl

## # A tibble: 1 x 1
##   player_id
##       <int>
## 1  73017395

bestgame<-df_clean[df_clean$player_id==73017395, "game"]
bestgame

## # A tibble: 1 x 1
##   game
##   <fct>
## 1 Sid Meier's Civilization V

How many players have 0 playing time?

zt<-nrow(df_play%>%filter(total_hrs==0))
cat("players with zero total hours played:", zt)

## players with zero total hours played: 1042

1042 players or 8.4% of unique id’s have 0 playing times.

Players distribution

#histogram of players  distribution

pd12<-df_game%>%ggplot(aes(x=log10(count)))+geom_histogram(aes(y=..density..),bins=25, colour="black", fill="white")+labs(title="games by players distribution") +geom_density(alpha=.2,fill="#0040FF")+geom_vline(aes(xintercept=mean(log10(count), na.rm=T)), color="red", linetype="dashed", size=1)


#histogram of tot_hours  distribution
pd22<-df_play%>%filter(total_hrs!=0)%>%ggplot(aes(x=log10(total_hrs)))+
        geom_histogram(aes(y=..density..),bins=25, colour="black", fill="white")+
        labs(title="total hours by player distribution")+geom_density(alpha=.2, fill="#0040FF")+
        geom_vline(aes(xintercept=mean(log10(total_hrs), na.rm=T)), color="red", linetype="dashed", size=1)


#histogram of ave_hours  distribution
pd32<-df_play%>%filter(ave_hrs!=0)%>%ggplot(aes(x=log10(ave_hrs)))+geom_histogram(aes(y=..density..),bins=25, colour="black", fill="white")+
        labs(title="average hours by player distribution")+geom_density(alpha=.2, fill="#0040FF")+geom_vline(aes(xintercept=mean(log10(ave_hrs), na.rm=T)), color="red", linetype="dashed", size=1)
iron(pd12, pd22, pd32)

Summary of the findings:

Part I. Games analysis:

1. There’s a small number of very popular games. A vast number of games in the dataset (30%) have 0 playing time. However, it’s not possible to distinguish between genuinely unpopular games, new games, and various game add-ons/updates.

2. The distribution of players by game looks power law (possibly Pareto). The data on total number of hours played by game is extremely over-dispersed.

3. Among the top 10 games, the most popular genre is shooter, mode - multi+single player. 4. The left tail data on playing time is over inflated with 0s due to the fact that There are add-ons present in the dataset, and one cannot distinguish between the add-ons and “unpopular” games. Should be careful when modeling such data, it’s better to complete the dataset with categorical variable on type of games.

Part II. Players’ behavior:

1. Very few users (8%) are engaged with games for high number of hours. Very few players (approx. 500 out of 12000 are “heavy gamers”). 1042 players or 8.4% of unique id’s have 0 playing times. Median users spend 5.7 hrs per game. Not clear if this can be attributed to user’s behavior or games features (duration of games, duration of the easiest levels).

2. Hours played look like log-normally distributed data. Negative binomial distribution - can be a start for modeling the number of games per players (taking into account some over dispersion in the data).

Share on

Twitter Facebook Google+ LinkedIn

Anna Belova

Steam video games analysis

Analysis

Cleaning the data

Game Analysis

What are the most popular games?

Ready to explore the least popular games?

Distributions

Behavior Analysis

How many “heavy gamers” there’re among players?

The most “heavy” gamer

How many players have 0 playing time?

Players distribution

Share on

Leave a Comment

You May Also Enjoy

Web scraping

Working with databases in R: dplyr

Linking to a Pdf in Jekyll

Hello, World!