More and more I'm using the 'apply' functions in R (apply, mapply, sapply, lappy,...). The help functions on these are hard to decipher. Not till I read a post by Neil Saunders, did I really start using them instead of loops.

Lately, I've been creating more nested-apply functions, i.e., an apply function within an apply function. The latest nested apply function I created does something really neat with specific data. What it does is create a list of indices in a sparse matrix down columns of blocks of numbers. Let's look at a simple example.

Let's say here is our raw data:

What we are looking to create in R is a list of the block indices of non-zero numbers by column. Here is the function with it's output:

> data = matrix(data=c(2,2,3,0,0,0,0,1,1,0,0,4,3,0,0,2,3,4,0,3,3),

+ nrow=7,ncol=3,dimnames=list(1:7,c("A","B","C")))

> data

A B C

1 2 1 0

2 2 1 2

3 3 0 3

4 0 0 4

5 0 4 0

6 0 3 3

7 0 0 3

The output we seek is a list containing the indices for each non-zero block. For example we want the output for column A to look like:

A[[1]] = 1 2 3 (because the first three positions in column A are non-zero and compromise one block of numbers.

Column B has two blocks, so the output should be:

B[[1]] = 1 2

B[[2]] = 5 6

Here's the nested apply function to create this list (lapply nested inside an apply on the columns).

> list.data = apply(data,2,function(x) lapply(unique(cumsum(c(FALSE, diff(which(x>0))!=1))+1),function(y){

+ which(x>0)[(cumsum(c(FALSE, diff(which(x>0))!=1))+1)==y]

+ }))

> list.data

$A

$A[[1]]

1 2 3

1 2 3

$B

$B[[1]]

1 2

1 2

$B[[2]]

5 6

5 6

$C

$C[[1]]

2 3 4

2 3 4

$C[[2]]

6 7

6 7

Let's break it down step by step. The first thing is to find the non-zero indices. (Let's perform this on the second column, B)

> which(data[,2]>0)

1 2 5 6

1 2 5 6

Remember R reads down columns. Now let's find where non-consecutive differences occur (to separate the blocks).

> diff(which(data[,2]>0))

2 5 6

1 3 1

Notice, wherever there are non-ones is where the break occurs. But because of how the 'diff' function works, we need to add a position in front. Let's also convert them to logicals.

> c(FALSE, diff(which(data[,2]>0))!=1)

2 5 6

FALSE FALSE TRUE FALSE

If we do a cumulative sum of these logicals (remember FALSE = 0, and TRUE = 1) we will have separated the blocks with integers starting at 1.

> cumsum(c(FALSE, diff(which(data[,2]>0))!=1))

2 5 6

0 0 1 1

If we add one and find the unique ones, we get:

> unique(cumsum(c(FALSE, diff(which(data[,2]>0))!=1))+1)

[1] 1 2

Now let's wrap it up in a 'lapply' function to return a list for column B:

> lapply(unique(cumsum(c(FALSE, diff(which(data[,2]>0))!=1))+1),function(y){

+ which(data[,2]>0)[(cumsum(c(FALSE, diff(which(data[,2]>0))!=1))+1)==y]

+ })

[[1]]

1 2

1 2

```
```

`[[2]]`

5 6

5 6

Now we just stick it in an apply function across the columns to find the indices of blocks of non-zero numbers. (Or across rows if you want.)

> apply(data,2,function(x) lapply(unique(cumsum(c(FALSE, diff(which(x>0))!=1))+1),function(y){

+ which(x>0)[(cumsum(c(FALSE, diff(which(x>0))!=1))+1)==y]

+ }))

$A

$A[[1]]

1 2 3

1 2 3

$B

$B[[1]]

1 2

1 2

$B[[2]]

5 6

5 6

$C

$C[[1]]

2 3 4

2 3 4

$C[[2]]

6 7

6 7

Sometimes I get so proud of my nested apply functions... sniff... sniff. One day I hope to write a triple nested apply function. I hope someone finds this useful.