R analytics blog

Let’s start with a few observations about linear models. The usual discussion posits a target variable `y`

of
length N and a design matrix `X`

with p columns (of full rank, say) and N rows. The linear model equation can be written

and the least squares solution is obtained by projecting `y`

orthogonally onto the image of `X`

giving the almost iconic formula in the title for the estimated parameters.

The most common case – unless you are a geneticist –
is that `X`

is ‘long’, i.e. we have N >> p. We can test some properties of our least squares solution in this case in a little simulation

```
p = 1000
N = 10^5
beta = runif(p, -20, 20)
X <- matrix(rnorm(p*N), ncol = p)
y <- X %*% beta + rnorm(N, sd = 5)
system.time(XtX <- crossprod(X))[3]
```

```
## elapsed
## 62.784
```

`system.time(Xty <- drop(crossprod(X,y)))[3]`

```
## elapsed
## 0.308
```

`system.time(beta_est <- solve(XtX, Xty))[3]`

```
## elapsed
## 0.355
```

Classical treatments of the least squares solution focus on decomposing `X`

in a suitable manner to address both, the crossproduct and the inversion. This somewhat obscures the fact that there are two distinct data steps:

- a ‘small data’ step: Even with our hypothetical
`p = 1000`

coefficients, the inversion only takes a few seconds - a ‘big data’ step: The
*crossproduct*is what we need to worry about much more for typical large`X`

!

Our example case (large `X`

with N >> p) is ideal for chunking and parallelising. Chunking is made easy by packages like
`biglm`

and `speedglm`

that provide the infrastructure for handling chunked data and for updating estimates with new chunks.
Also note that the crossproducts
are ‘embarrassingly parallel’, i.e. they can be calculated via an appropriate ‘lapply’-ification:

```
library(parallel)
X_chunked <- lapply(split(X, (1:N)%%9) , matrix, ncol = p)
system.time(XtX_chunked <- mclapply(X_chunked, crossprod, mc.cores = 3))[3]
```

```
## elapsed
## 31.675
```

```
XtX2 <- 0; for(xtx in XtX_chunked) XtX2 <- XtX2 + xtx
sum(abs(XtX- XtX2)) ## ~ 0
```

So, chunking and parallelising helps.

To improve things further, one needs to look more closely at the structure of `X`

. One well-known approach is in the `MatrixModels`

-package, which allows to make use of sparsity in `X`

, for instance, when `X`

contains many dummy variables.

`X`

– star schemaThe situation for which I have not immediately found an adequate package, and which I want to discuss in a bit more detail here, is when `X`

is constructed from some kind of database ‘star schema’. As an example, we can take the fuel price data discussed here.

We start with loading the required packages and by defining a small benchmarking function, which will not only measure execution times, but also give a rough idea on memory usage:

```
## Clear all -- only if you have saved your other projects
rm(list=ls(all=TRUE))
library(tidyverse)
library(stringr)
library(splines)
library(Matrix)
library(MatrixModels)
## a simple benchmarker
mstest <- function(expr=NULL) {
mem <- gc(reset = TRUE)[2, c(2,6)]
st <- proc.time()
expr
mem2 <- gc(reset = FALSE)[2, c(2,6)] - mem
names(mem2) <- c("A_Mem final", "A_Mem max")
c(A_Time =structure(proc.time()- st, class = "proc_time")[1:3], mem2)
}
```

Now load (and clean) the fuel price data (find it on GitHub here). It is a very simple star schema, with fuel prices per station and time as fact table:

```
pricesAll <- readRDS( "Prices_Demo.rds")%>%
filter(E10 > 700, E10 < 3500, TimeAggID > 26425,
## choose TimeAggID filter to fit to your machine
!is.na(CompE10))
pricesAll
```

```
## StationID TimeAggID E10 CompE10
## <int> <int> <int> <int>
## 1 1 26426 1269 1450
## 2 1 26427 1269 1450
## 3 1 26428 1269 1450
## 4 1 26429 1269 1450
## 5 1 26430 1269 1439
## # ... with 4,859,833 more rows
```

as well as dimension tables for stations and times:

```
timesAll <- readRDS( "E_prtimes.rds")
timesAll
```

```
## TimeAggID DateHour DateDay HourCl DayCl Brent
## <int> <int> <int> <fctr> <fctr> <int>
## 1 1 14060800 140608 Hour00 Sun 501
## 2 2 14060801 140608 Hour01 Sun 501
## 3 3 14060802 140608 Hour02 Sun 501
## 4 4 14060803 140608 Hour03 Sun 501
## 5 5 14060804 140608 Hour04 Sun 501
## # ... with 27,523 more rows
```

```
stationsAll <- readRDS( "C_stationsAll_final.rds")
## add natural splines to stations
NS <- as_data_frame( model.matrix( ~ -1 + ns(lat, df = 3):ns(lng, df = 3),
data = stationsAll ))
names(NS) <- paste0("NS",str_replace_all(names(NS),
"\\:|ns\\(lat, df = \\d+\\)|ns\\(lng, df = \\d+\\)", ""))
stationsAll <- bind_cols(stationsAll, NS)
stationsAll
```

```
## StationID brandCl lng lat isBAB closeBAB NS11
## <int> <fctr> <dbl> <dbl> <int> <int> <dbl>
## 1 1 Others 10.850848 48.55568 0 0 0.39713860
## 2 2 BfT 10.920224 49.66159 0 0 0.43709986
## 3 3 Others 6.948950 51.16919 0 0 0.01966565
## 4 4 Others 11.451392 48.12189 0 1 0.37167161
## 5 5 Others 12.520820 48.24147 0 0 0.28931168
## # ... with 14,897 more rows, and 8 more variables: NS21 <dbl>, NS31 <dbl>,
## # NS12 <dbl>, NS22 <dbl>, NS32 <dbl>, NS13 <dbl>, NS23 <dbl>, NS33 <dbl>
```

`X`

Now, let’s say we want to estimate the influence of

- the brand of the station
- the closeness of the station to a highway (via isBAB, closeBAB)
- the location of the station in Germany (via the natural splines on long / lat)
- the time of day
- the average price of Competitors (CompE10 – I’ll say more about this, later)

etc. on the price of E10 fuel.

In the traditional approach, we join all the dimension information to the fact table and run the model on the resulting large table:

```
fulldat <- pricesAll %>%
left_join(select(timesAll, -DateHour, -DateDay )) %>%
left_join(stationsAll)
object.size(fulldat)/10^6 ## 622MB
RESULT <- NULL
modfor <- as.formula(
paste0("E10 ~ 1 + CompE10 + Brent + isBAB + closeBAB + brandCl +",
"DayCl + HourCl +",
paste(names(NS), collapse = " + ")))
msLM <- mstest(modLM <- lm( modfor, data = fulldat))
RESULT[["trad. LM"]] <- c(msLM, modLM$coefficients)
msLM
```

```
## A_Time.user.self A_Time.sys.self A_Time.elapsed A_Mem final
## 41.343 0.854 42.315 2870.500
## A_Mem max
## 5058.200
```

So, how to make this smaller and faster?

The main observation is that the parts of the design matrix, that come from the joined dimension matrices, have a very simple form. For example, when looking at the variables which depend on time only:

` matTime = model.Matrix(~ -1 + Brent + DayCl + HourCl, data = timesAll )`

is the design matrix on the table `timesAll`

and

`indTime = as(list(pricesAll$TimeAggID, max(timesAll$TimeAggID)), "indMatrix")`

is the (sparse version of) the ‘index matrix’, representing the join. This matrix has a 1 in position `(ro, co)`

exactly when `TimeAggID[ro] == co`

and zeros otherwise. It is easy to check that the matrix product of `indTime`

and `matTime`

is the lift / join of `matTime`

to the fact table

`all(indTime %*% matTime == matTime[pricesAll$TimeAggID, ]) ## TRUE`

This is a very useful representation, as

- matTime will be comparatively small,
- crossproducts with the index matrix 'indTime' amount to a simple summing over groups, and need to be executed only once, regardless the size of matTime

Furthermore, index matrix algebra is already implemented in the `Matrix`

-package.

Let’s call these objects ‘lifted matrices’. We can actually define them as a simple extension of the classes in `Matrix`

:

```
setClass("liftMatrix",
representation(mat = "Matrix",
indI = "indMatrix"))
```

The beauty of the ‘Matrix’-package (and the S4-system it is built on) is that we now only need to implement the algebra for this new class as a set of S4-methods,
adding to (in one case for the moment: overriding) the existing method-definitions in `Matrix`

:

```
## changes some 'Matrix'-definitions. Restart R afterwards
setMethod("crossprod", signature( x = "indMatrix", y = "missing"),
function(x) Diagonal(x = tabulate(x@perm, x@Dim[2])))
setMethod("crossprod", signature( x = "indMatrix", y = "indMatrix" ),
function(x, y) {
l1 <- as(x, "lMatrix")
l2 <- as(y, "lMatrix")
crossprod(l1,l2)
})
setMethod("crossprod", signature( x = "liftMatrix", y = "missing"),
function(x) crossprod(x@mat, crossprod(crossprod(x@indI) , x@mat)))
setMethod("crossprod", signature( x = "liftMatrix", y = "liftMatrix" ),
function(x, y) crossprod(x@mat,
crossprod(crossprod(y@indI, x@indI), y@mat)))
setMethod("crossprod", signature( x = "liftMatrix", y = "Matrix" ),
function(x, y) {
crossprod(x@mat, crossprod(x@indI, y)) })
setMethod("crossprod", signature( x = "Matrix", y = "liftMatrix" ),
function(x, y) {
crossprod(x, y@indI) %*% y@mat })
```

With these definitions, it is now easy to define the ‘lifted’ (or ‘pushed down’, depending on your point of view) version of our fuel price model above, by separately defining the components of the design which stem from the two dimension tables and (for the effect of `CompE10`

) from the fact table itself.

I go through the full calculation here for demonstration purposes. I hope to put this into a more user-friendly format, soon.

```
msLift <- mstest({
modforS <- as.formula(
paste0("~ 1 + isBAB + closeBAB + brandCl +",
paste(names(NS), collapse = " + ")))
XS <- new("liftMatrix",
mat = model.Matrix(modforS, data = stationsAll )*1.0,
indI = as(list(pricesAll$StationID, max(stationsAll$StationID)),
"indMatrix"))
modforT <- as.formula(paste0("~ 1 + Brent + DayCl + HourCl"))
XT <- new("liftMatrix",
mat = model.Matrix(modforT, data = timesAll )[,-1]*1.0,
indI = as(list(pricesAll$TimeAggID, max(timesAll$TimeAggID)),
"indMatrix"))
XC <- Matrix( pricesAll$CompE10, ncol = 1,
dimnames = list(NULL, "CompE10"))
Y <- Matrix(pricesAll$E10, ncol = 1)
XTXT <- crossprod(XT)
XSXS <- crossprod(XS)
XCXC <- crossprod(XC)
XTXS <- crossprod(XT, XS)
XTXC <- crossprod(XT, XC)
XSXC <- crossprod(XS, XC)
XSY <- crossprod(XS, Y)
XTY <- crossprod(XT, Y)
XCY <- crossprod(XC, Y)
XX <- rbind(cbind(XTXT, XTXS, XTXC),
cbind(t(XTXS), XSXS, XSXC),
cbind(t(XTXC), t(XSXC), XCXC))
XY <- rbind(XTY, XSY, XCY)
modLift <- drop(solve(XX, XY))
})
RESULT[["lifted LM"]] <- c(modLift, msLift)
```

A bit of arranging of the results for easier comparison:

```
for(nam in names(RESULT)) RESULT[[nam]]<-
data.frame( estimate = RESULT[[nam]]) %>%
rownames_to_column(var = "term") %>%
mutate(model = nam)
resultWide <- bind_rows(RESULT) %>% dplyr::select(model, term, estimate) %>%
spread(model, estimate)
head(resultWide, 10)
```

```
## term lifted LM trad. LM
## 1 A_Mem final 84.000000 2870.500000
## 2 A_Mem max 2103.000000 5058.200000
## 3 A_Time.elapsed 3.510000 42.315000
## 4 A_Time.sys.self 0.204000 0.854000
## 5 A_Time.user.self 3.291000 41.343000
## 6 brandClAral 23.329966 23.329966
## 7 brandClAvia -5.360235 -5.360235
## 8 brandClBfT -17.300784 -17.300784
## 9 brandClEsso -3.675214 -3.675214
## 10 brandClHEM -20.194030 -20.194030
```

Four seconds instead of forty – Pretty impressive, huh?

Well, of course, I have massaged the example a bit. The model we have been looking at is a pretty extreme case. Almost all the explanatory variables are ‘lifts’ from one of the dimension tables. So the model is almost a product of two completely separate models. I actually had to sneak in the variable ‘CompE10’ (with all the problems, endogeneity, inconsistency, this variable brings, were we to really try to interpret this model) in order to make the example a truly non-product one.

Still, this approach is worthwile in many cases, and I will soon publish a more elaborate example, including how to deal with chunked and partitioned data, as well as allowing fixed effects.

So stay tuned!

Boris Vaillant - Quantitative Consulting 17

QC 17