Steam video games analysis
Below are some quick results of an exploratory analysis of Steam Data Set available on Kaggle. The dataset contains a single table with four columns: user_id, game_title, behaviour_name, values. Data description:
“Steam is the world’s most popular PC Gaming hub. With a massive collection that includes everything from AAA blockbusters to small indie titles, great discovery tools can be super valuable for Steam. How can we make them better? This dataset is a list of user behaviors, with columns: user-id, game-title, behavior-name, value. The behaviors included are ‘purchase’ and ‘play’. The value indicates the degree to which the behavior was performed - in the case of ‘purchase’ the value is always 1, and in the case of ‘play’ the value represents the number of hours the user has played the game.”

Analysis
The analysis will be broken into 3 parts: cleaning the data, game analysis, and players behavior analysis. I will attempt to find the most and the least popular games, distributions of users and hours played, some patterns in players behavior.
Cleaning the data
First, let’s import the dataset and load some libraries:
steam.200k <- read.csv("~/Desktop/wargaming/analysis/steam-200k.csv", header=FALSE)
library(ggplot2) #graphics
library(dplyr)
library(RColorBrewer)
library(knitr)
library(kableExtra) #nicely formated html tables
library(plotly) #graphics
library(waffle) #waffle charts
library(psych) #to call describe function
library(scales) #percentage scale for quantile chart
Next, write the raw data into a new data frame df. It’s always a good practice to keep initial dataset intact. I might need to refer to it at later stages of the analysis and/or use it to verify that the data hasn’t been messed up in the process.
Anyways, let’s look inside the newly created df:
df<-steam.200k
#options:
#dim(df) # dims
#str(df) # data description
#summary(df) #summary statistics
#view the top 5 rows of the data frame
knitr::kable(head(df, 5), format="html") %>% kable_styling(bootstrap_options = "striped", full_width = F)
V1 | V2 | V3 | V4 | V5 |
---|---|---|---|---|
151603712 | The Elder Scrolls V Skyrim | purchase | 1 | 0 |
151603712 | The Elder Scrolls V Skyrim | play | 273 | 0 |
151603712 | Fallout 4 | purchase | 1 | 0 |
151603712 | Fallout 4 | play | 87 | 0 |
151603712 | Spore | purchase | 1 | 0 |
Column V5 looks suspicious. Summary shows it indeed contains only 0s, thus is meaningless and can be removed.
summary(df$V5)
df<-df[,-5]
Rename all other columns
df<- df %>% rename(player_id = V1,
game = V2,
behavior = V3,
value = V4)
knitr::kable(head(df, 5), format="html") %>% kable_styling(bootstrap_options = "striped", full_width = F)
player_id | game | behavior | value |
---|---|---|---|
151603712 | The Elder Scrolls V Skyrim | purchase | 1 |
151603712 | The Elder Scrolls V Skyrim | play | 273 |
151603712 | Fallout 4 | purchase | 1 |
151603712 | Fallout 4 | play | 87 |
151603712 | Spore | purchase | 1 |
From the data description, I know that for any unique player_id, if the value of behavior ==“played”, then there will be two rows for the same game. If the game was purchased but not played, there will be only one row for each game with behavior == “purchased”. Let’s arrange this information in a more convenient form. Create two new columns played and purchased filled with 0s and 1s, respectively.
df1<-df%>% mutate(played =0, purchased =1L)
#rearrange df
df2<-df1 %>%
group_by(player_id, game)%>% # group by unique id's and game
mutate(group_count=n()) %>% #create a temporary column with the group count. If the group count is 2, we'll delete one row (the came is both purchased and played) and put this info into respective columns. Otherwise we'll keep a single row (the game has been purchased but not played)
ungroup%>%
mutate(played=if_else(behavior=="play", value,0))%>% #fill in played column with the hours, if the game was played. otherwise the value will be 0.
filter(behavior == "play" | group_count == 1) #filter the rows with the behavior == "play" or group_count==1 (which means purchased but not played)
df3<-df2[,-c(3,4,7)] # remove extra columns behavior, value, group_count
df_clean<-df3
knitr::kable(head(df_clean), format="html") %>% kable_styling(bootstrap_options = "striped", full_width = F) #take a look at the result
player_id | game | played | purchased |
---|---|---|---|
151603712 | The Elder Scrolls V Skyrim | 273.0 | 1 |
151603712 | Fallout 4 | 87.0 | 1 |
151603712 | Spore | 14.9 | 1 |
151603712 | Fallout New Vegas | 12.1 | 1 |
151603712 | Left 4 Dead 2 | 8.9 | 1 |
151603712 | HuniePop | 8.5 | 1 |
Game Analysis
#create a data frame with the game stats
df_game<-df_clean%>%
group_by(game) %>% #group by game names
summarise(count=n(), # count the number of game purchases/players
total_hrs=sum(played), #count the total gaming time per game
ave_hrs=mean(played)) %>% #average gaming time (total time/count)
mutate(total_years = round(total_hrs/(24*365), digits=2)) #translate total time into years
options(repr.plot.width=4, repr.plot.height=5)
Check some summary statistics for the game data
des <-describe(df_game)
print(des, digits = 2)
## vars n mean sd median trimmed mad min
## game* 1 5155 2578.00 1488.26 2578.00 2578.00 1911.07 1
## count 2 5155 24.91 101.35 5.00 9.93 5.93 1
## total_hrs 3 5155 668.35 14990.06 3.90 27.50 5.78 0
## ave_hrs 4 5155 6.78 29.43 0.78 2.14 1.15 0
## total_years 5 5155 0.08 1.71 0.00 0.00 0.00 0
## max range skew kurtosis se
## game* 5155.00 5154.00 0.00 -1.20 20.73
## count 4841.00 4840.00 25.99 1066.14 1.41
## total_hrs 981684.60 981684.60 56.89 3602.33 208.78
## ave_hrs 1295.00 1295.00 21.04 758.01 0.41
## total_years 112.06 112.06 56.88 3601.79 0.02
sum<-rbind(summary(df_game$count),summary(df_game$total_hrs),summary(df_game$ave_hrs))
name =c('count','total_hrs','ave_hrs')
sum=cbind(name, sum)
sum
## name Min. 1st Qu. Median Mean
## [1,] "count" "1" "2" "5" "24.9064985451018"
## [2,] "total_hrs" "0" "0" "3.9" "668.354180407371"
## [3,] "ave_hrs" "0" "0" "0.775" "6.7770375943326"
## 3rd Qu. Max.
## [1,] "18" "4841"
## [2,] "37.9" "981684.6"
## [3,] "3.99010263929619" "1295"
What are the most popular games?
First of all, the answer would depend on we one define “popularity”. Let’s start by looking at the top 20 games by the number of purchases (users) and the top 20 by total time played.
options(repr.plot.width=4, repr.plot.height=5)
plot1<- df_game %>% arrange(desc(count)) %>% slice(1:20)%>%
ggplot(aes(x=reorder(game, count), y=count, fill=count))+geom_bar(stat = "identity")+
geom_text(aes(label=count), hjust= -0.5, size=2, position = position_dodge(width = 0.5),inherit.aes = TRUE)+
scale_fill_gradient2()+
coord_flip(ylim=c(0,5500))+labs(x = '', y = 'users', title = 'Top 20 most purchased games')
plot2<- df_game %>% arrange(desc(total_hrs)) %>% slice(1:20)%>%
ggplot(aes(x=reorder(game, total_hrs), y=total_years, fill=total_years))+geom_bar(stat = "identity")+
geom_text(aes(label=total_years), hjust= -0.5, size=2, position = position_dodge(width = 0.5),inherit.aes = TRUE)+
scale_color_grey()+
coord_flip(ylim=c(0,140))+labs(x = '', y = 'years', title = 'Top 20 games by years played')
plot1
plot2
Okay. It seems, there are several games that appear on both charts. Let’s find out which are these games exactly. For doing this I use left_join() command, create a new table with the resulting values, and add columns genre and mode (with the information manually collected from Wikipedia)
#find most popular games by both criteria
userbased<-df_game %>% arrange(desc(count)) %>% slice(1:20)
hourbased<- df_game %>% arrange(desc(total_hrs)) %>% slice(1:20)
popular<-left_join(userbased, hourbased, by = 'game')%>%na.omit()
genre<-c('MOBA', '1st perston shooter','1st perston shooter','1st perston shooter','1st perston shooter','1st perston shooter', '3rd perston shooter', 'sandbox', 'action role playing', 'strategy')
mode<-c('both','both', 'multiplayer', 'multiplayer', 'both', 'multiplayer','both','both','single player', 'both')
popular2<-cbind(popular, genre, mode)
game | count | total_hrs | ave_hrs | total_years | genre | mode |
---|---|---|---|---|---|---|
Dota 2 | 4841 | 981684.6 | 202.78550 | 112.06 | MOBA | both |
Team Fortress 2 | 2323 | 173673.3 | 74.76251 | 19.83 | 1st perston shooter | both |
Counter-Strike Global Offensive | 1412 | 322771.6 | 228.59178 | 36.85 | 1st perston shooter | multiplayer |
Counter-Strike Source | 978 | 96075.5 | 98.23671 | 10.97 | 1st perston shooter | multiplayer |
Left 4 Dead 2 | 951 | 33596.7 | 35.32776 | 3.84 | 1st perston shooter | both |
Counter-Strike | 856 | 134261.1 | 156.84708 | 15.33 | 1st perston shooter | multiplayer |
Warframe | 847 | 27074.6 | 31.96529 | 3.09 | 3rd perston shooter | both |
Garry’s Mod | 731 | 49725.3 | 68.02367 | 5.68 | sandbox | both |
The Elder Scrolls V Skyrim | 717 | 70889.3 | 98.86932 | 8.09 | action role playing | single player |
Sid Meier’s Civilization V | 596 | 99821.3 | 167.48540 | 11.40 | strategy | both |
Now, let’s see how the most popular games (in terms of number of players) are distributed by genre and mode. Instead of conventional pie charts, I’ll draw waffles :)
Shooters and MOBA genres together amount to over 80% of players of the top 10 games. And the popular games attracting the most users are those with both (multi- and single-player) mode.
waffle2<-popular2%>%group_by(genre)%>%mutate(count_mode=sum(count))%>%select(c("genre", "count_mode"))%>%unique()
waffle2$percent<-waffle2$count_mode/sum(waffle2$count_mode)*100
waffle1<-popular2%>%group_by(mode)%>%mutate(count_mode=sum(count))%>%select(c("mode", "count_mode"))%>%unique()
waffle1$percent<-waffle1$count_mode/sum(waffle1$count_mode)*100
iron(
waffle(c(`MOBA (34%)` =34, `1st person shooter (45.8%)`=45.8, `3rd person shooter(5.9%) `= 5.9, `sandbox (5.1%)` =5.1, `action role playing (5%)` =5, `strategy (4.2%)` =4.2), rows = 6, title ="Number of players in 10 most popular games by genre"),
waffle(c(`both (72.2%)` =72.2, `multiplayer (22.8%)`=22.8, `singleplayer (5%) `= 5), rows = 6, title ="Number of players in 10 most popular games by mode")
)
Ready to explore the least popular games?
Firstly, it’s interesting to see how many games were purchased and never played.
notpl<-df_game%>%filter(df_game$total_hrs == 0)
dim(notpl)
## [1] 1555 5
summary(notpl)
## game count total_hrs
## 16 Bit Arena : 1 Min. : 1.000 Min. :0
## 1701 A.D. Gold Edition : 1 1st Qu.: 1.000 1st Qu.:0
## 1953 - KGB Unleashed : 1 Median : 2.000 Median :0
## 1954 Alcatraz : 1 Mean : 7.273 Mean :0
## 1Heart : 1 3rd Qu.: 5.000 3rd Qu.:0
## 3DMark API Overhead feature test: 1 Max. :378.000 Max. :0
## (Other) :1549
## ave_hrs total_years
## Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0
## Median :0 Median :0
## Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0
##
Wow, the number of the “never played” games is 1555, or 30% of the total number of games in our dataset! The median number of purchases for the “never played” games is only 2, the maximum is 378. Unfortunately, we cannot really tell the new/just issued games from the unpopular once, since there’s no time indicators whatsoever.
Let’s find out which games have been purchased the maximum number of times, but never played.
plot3<- df_game %>% filter(df_game$total_hrs == 0) %>% arrange(desc(count)) %>% slice(1:20)%>%
ggplot(aes(x=reorder(game, count), y=count, fill=count))+geom_bar(stat = "identity")+
geom_text(aes(label=count), hjust= -0.5, size=2, position = position_dodge(width = 0.5),inherit.aes = TRUE)+
scale_fill_gradient2(low = "white", mid = "grey", high = "black")+
coord_flip(ylim=c(0,450))+labs(x = '', y = 'number of purchases', title = 'Top 20 purchased & never played games')
plot3
Quick search shows that some of these games are proper games, while others are add-ons/ expansion packs/patches. Clearly, this means our data-set is non homogeneous. The presence of add-ons explains high number of 0s in hours played column, which probably will cause certain problems with distributions later on. Would be nice to have a proper classification variable for the games for deeper analysis.
Distributions
Based on the results above, we filter out the values with number of hours == 0. The distribution of total and average number of hours played are close to log-normal. The number of purchases by game doesn’t get “normal” after taking logs. It looks like some power law distribution (check a histogram in levels). Power law distribution is used for modeling internet sites with x users, for example. Looks like the idea can be applied to games. One can further investigate the issue with KS-test. There’s large mass concentrated around 0 (very small number of purchases), can be caused by the above mentioned add-ons issue, and the fact that we don’t know any time variables (some games might have just aired).
A very nice reference on Random Number Distributions
#histogram of players distribution
d1<-df_game%>%ggplot(aes(x=count))+geom_histogram(bins=50)+labs(title="Players by game distribution")
d11<-df_game%>%ggplot(aes(x=log10(count)))+geom_histogram(bins=50)+labs(title="Players by game distribution")
d12<-df_game%>%ggplot(aes(x=log10(count)))+geom_histogram(aes(y=..density..),bins=25, colour="black", fill="white")+labs(title="count by game distribution")+geom_density(alpha=.2, fill="#0040FF")
d13<-df_game%>%filter(total_hrs!=0)%>%ggplot(aes(x=log10(count)))+geom_histogram(aes(y=..density..),bins=25, colour="black", fill="white")+labs(title="Count by game distribution")+geom_density(alpha=.2, fill="#0040FF")+geom_vline(aes(xintercept=mean(log10(count), na.rm=T)), color="red", linetype="dashed", size=1)
#histogram of tot_hours distribution
d2<-df_game%>%ggplot(aes(x=total_hrs))+geom_histogram(bins=50)+labs(title="total hours by game distribution")
d21<-df_game%>%ggplot(aes(x=log10(total_hrs)))+geom_histogram(bins=25)+labs(title="total hours by game distribution") #r removes entries with 0 hours played
d22<-df_game%>%filter(total_hrs!=0)%>%ggplot(aes(x=log10(total_hrs)))+
geom_histogram(aes(y=..density..),bins=25, colour="black", fill="white")+
labs(title="total hours by game distribution")+geom_density(alpha=.2, fill="#0040FF")+
geom_vline(aes(xintercept=mean(log10(total_hrs), na.rm=T)), color="red", linetype="dashed", size=1)
#histogram of ave_hours distribution
d3<-df_game%>%ggplot(aes(x=ave_hrs))+geom_histogram(bins=50)+labs(title="average hours by game distribution")
d31<-df_game%>%ggplot(aes(x=log10(ave_hrs)))+geom_histogram(bins=25)+labs(title="average hours by game distribution") #r removes entries with 0 hours played
d32<-df_game%>%filter(ave_hrs!=0)%>%ggplot(aes(x=log10(ave_hrs)))+geom_histogram(aes(y=..density..),bins=25, colour="black", fill="white")+
labs(title="average hours by game distribution")+geom_density(alpha=.2, fill="#0040FF")+geom_vline(aes(xintercept=mean(log10(ave_hrs), na.rm=T)), color="red", linetype="dashed", size=1)
iron(d13, d22, d32)
The next graph is a very good illustration of 20/80 rule (Pareto principle): top 20% of games accumulate 80% of the players.
q1 <- df_game %>%
mutate(rank = min_rank(count)) %>%
arrange(rank) %>%
mutate(index = row_number(),
game_quantile = index/n(),
total_player_quantile = cumsum(count)/sum(count)) %>%
ggplot(aes(x = game_quantile, y = total_player_quantile)) +
geom_area(fill = "skyblue3") +
scale_x_continuous(labels = percent) +
scale_y_continuous(labels = percent) +
theme_light() +
labs(title = "players/game percentage accumulation graph")+ geom_hline(aes(yintercept=.2), color="black", linetype="dashed", size=.5)+ geom_vline(aes(xintercept=.8), color="black", linetype="dashed", size=.5)
q2 <- df_game %>%
mutate(rank = min_rank(total_hrs)) %>%
arrange(rank) %>%
mutate(index = row_number(),
game_quantile = index/n(),
total_hrs_quantile = cumsum(total_hrs)/sum(total_hrs)) %>%
ggplot(aes(x = game_quantile, y = total_hrs_quantile)) +
geom_area(fill = "skyblue3") +
scale_x_continuous(labels = percent) +
scale_y_continuous(labels = percent) +
theme_light() +
labs(title = "total hrs/game percentage accumulation graph")
q3 <- df_game %>%
mutate(rank = min_rank(ave_hrs)) %>%
arrange(rank) %>%
mutate(index = row_number(),
game_quantile = index/n(),
average_hrs_quantile = cumsum(ave_hrs)/sum(ave_hrs)) %>%
ggplot(aes(x = game_quantile, y = average_hrs_quantile)) +
geom_area(fill = "skyblue3") +
scale_x_continuous(labels = percent) +
scale_y_continuous(labels = percent) +
theme_light() +
labs(title = "average hrs/game percentage accumulation graph")
iron(q1, q2, q3)
Behavior Analysis
#Begin player's analysis by creating a new dataframe with unique players' IDs in the 1st column.
df_play<-df_clean%>%
group_by(player_id) %>% #group by player ID
summarise(count=n(), # number of games per player
total_hrs=sum(played), #общее игровое время игрока
ave_hrs=mean(played)) %>% #среднее игровое время игрока
mutate(total_years = round(total_hrs/(24*365), digits=2))
df_play1<-df_play%>%arrange(desc(count))
knitr::kable(head(df_play1, 10), format="html") %>% kable_styling(bootstrap_options = "striped", full_width = F)
player_id | count | total_hrs | ave_hrs | total_years |
---|---|---|---|---|
62990992 | 1061 | 5724.8 | 5.395664 | 0.65 |
33865373 | 774 | 2274.4 | 2.938501 | 0.26 |
30246419 | 752 | 1219.7 | 1.621941 | 0.14 |
58345543 | 665 | 1374.9 | 2.067519 | 0.16 |
76892907 | 597 | 2731.3 | 4.575042 | 0.31 |
20772968 | 590 | 1395.0 | 2.364407 | 0.16 |
11403772 | 584 | 6009.9 | 10.290925 | 0.69 |
64787956 | 579 | 1222.3 | 2.111054 | 0.14 |
22301321 | 562 | 1460.8 | 2.599288 | 0.17 |
47457723 | 554 | 4844.3 | 8.744224 | 0.55 |
A player has on average 10 games purchased, plays on average 278 hrs in total, and 55 hrs per game. The data looks over dispersed. Negative binomial distribution could be one of the popular options for modelling count data such as games by player (it’s derived from the Poisson when the mean itself distributed with gamma distribution. This allows for heterogeneity in a population, and is thus for over dispersion). Again - KS test should give more hints. The distribution of hours per player look roughly log-normal.
des <-describe(df_play)
print(des, digits = 2)
## vars n mean sd median trimmed
## player_id 1 12392 1.61126e+08 80007568.18 164933610.5 161266735.53
## count 2 12392 1.03600e+01 35.48 2.0 3.60
## total_hrs 3 12392 2.78030e+02 760.30 13.5 92.91
## ave_hrs 4 12392 5.54900e+01 241.31 5.7 15.34
## total_years 5 12392 3.00000e-02 0.09 0.0 0.01
## mad min max range skew kurtosis
## player_id 92918954.22 5250 309903146.00 309897896.00 -0.05 -0.92
## count 1.48 1 1061.00 1060.00 10.74 175.55
## total_hrs 20.02 0 11754.00 11754.00 5.50 43.17
## ave_hrs 8.25 0 11754.00 11754.00 17.81 581.98
## total_years 0.00 0 1.34 1.34 5.48 43.03
## se
## player_id 718721.06
## count 0.32
## total_hrs 6.83
## ave_hrs 2.17
## total_years 0.00
How many “heavy gamers” there’re among players?
Define “heavy gamers” as follows:
- total hours > mean + st.dev. = 278+760.3=1038.3
- average hours > mean + st.dev. = 55.4+241 = 296.4
- number of games > mean + st.dev = 10.36+35.46 = 45.8
“Heavy players”
hg1<-nrow(df_play%>%filter(total_hrs>1040))
hg2<-nrow(df_play%>%filter(ave_hrs>300))
hg3<-nrow(df_play%>%filter(count>46))
cat("by total hours played:", hg1)
## by total hours played: 986
cat('by average hours played per game:', hg2)
## by average hours played per game: 483
cat('by number of games:', hg3)
## by number of games: 565
The most “heavy” gamer
Someone has an average playing time per game = total time = 11754.00. Find ID and which game he/she plays, and the game he plays:
bestpl<-df_play[df_play$total_hrs==max(df_play$total_hrs), "player_id"]
bestpl
## # A tibble: 1 x 1
## player_id
## <int>
## 1 73017395
bestgame<-df_clean[df_clean$player_id==73017395, "game"]
bestgame
## # A tibble: 1 x 1
## game
## <fct>
## 1 Sid Meier's Civilization V
How many players have 0 playing time?
zt<-nrow(df_play%>%filter(total_hrs==0))
cat("players with zero total hours played:", zt)
## players with zero total hours played: 1042
1042 players or 8.4% of unique id’s have 0 playing times.
Players distribution
#histogram of players distribution
pd12<-df_game%>%ggplot(aes(x=log10(count)))+geom_histogram(aes(y=..density..),bins=25, colour="black", fill="white")+labs(title="games by players distribution") +geom_density(alpha=.2,fill="#0040FF")+geom_vline(aes(xintercept=mean(log10(count), na.rm=T)), color="red", linetype="dashed", size=1)
#histogram of tot_hours distribution
pd22<-df_play%>%filter(total_hrs!=0)%>%ggplot(aes(x=log10(total_hrs)))+
geom_histogram(aes(y=..density..),bins=25, colour="black", fill="white")+
labs(title="total hours by player distribution")+geom_density(alpha=.2, fill="#0040FF")+
geom_vline(aes(xintercept=mean(log10(total_hrs), na.rm=T)), color="red", linetype="dashed", size=1)
#histogram of ave_hours distribution
pd32<-df_play%>%filter(ave_hrs!=0)%>%ggplot(aes(x=log10(ave_hrs)))+geom_histogram(aes(y=..density..),bins=25, colour="black", fill="white")+
labs(title="average hours by player distribution")+geom_density(alpha=.2, fill="#0040FF")+geom_vline(aes(xintercept=mean(log10(ave_hrs), na.rm=T)), color="red", linetype="dashed", size=1)
iron(pd12, pd22, pd32)
Summary of the findings:
Part I. Games analysis:
1. There’s a small number of very popular games. A vast number of games in the dataset (30%) have 0 playing time. However, it’s not possible to distinguish between genuinely unpopular games, new games, and various game add-ons/updates.
2. The distribution of players by game looks power law (possibly Pareto). The data on total number of hours played by game is extremely over-dispersed.
3. Among the top 10 games, the most popular genre is shooter, mode - multi+single player. 4. The left tail data on playing time is over inflated with 0s due to the fact that There are add-ons present in the dataset, and one cannot distinguish between the add-ons and “unpopular” games. Should be careful when modeling such data, it’s better to complete the dataset with categorical variable on type of games.
Part II. Players’ behavior:
1. Very few users (8%) are engaged with games for high number of hours. Very few players (approx. 500 out of 12000 are “heavy gamers”). 1042 players or 8.4% of unique id’s have 0 playing times. Median users spend 5.7 hrs per game. Not clear if this can be attributed to user’s behavior or games features (duration of games, duration of the easiest levels).
2. Hours played look like log-normally distributed data. Negative binomial distribution - can be a start for modeling the number of games per players (taking into account some over dispersion in the data).