Add graph base matching #25

sgibb · 2019-10-25T10:06:04Z

This PR introduces joinGraph an alternative implementation for the idea of graph based matching as described by @tnaake in rformassspectrometry/Spectra#49 and proposed in PR #23

I am sorry but I don't want to accept @tnaake's PR #23 because of several problems.

Even after reading it multiple times I don't understand the code completely. So I don't want to maintain it.
(Ok, you could argue that you don't understand my code either.)
It doesn't work on my laptop except for toy examples because of the huge amount of memory it requires.
It changes the order of the mz values in spectra.

There are a few notable difference between my and @tnaake's approach:

I construct an edge list instead of an adjacency matrix which results in much less memory consumption.
I don't use the igraph package because the edge list contains already all information we need for an undirected graph.
For a smaller memory footprint I just pre-calculate the combinations of multiple matches (and reuse the distinct matches from the initial edge list).

@tnaake: while I won't accept your PR I greatly appreciate your contributions and would be very happy if you would be so kind to test this function with your data and review the code (I will add you to the reviewer as soon as you accept the collaborator invitation).

Toy example:

library("MsCoreUtils")
library("igraph")

source("playground/graphPeaks.R")

## toy example
x <- matrix(
     c(100.001, 100.002, 300.01, 300.02, 1, 9, 1, 9),
       ncol = 2L, dimnames = list(c(), c("mz", "intensity"))
)
y <- matrix(
    c(100.0, 200.0, 300.002, 300.025, 300.0255, 9, 1, 1, 9, 1),
    ncol = 2L, dimnames = list(c(), c("mz", "intensity"))
)

## the original dotproduct isn't working because `validPeaksMatrix` fails in
## `graphPeaks` (because the order of mz values changed)
.dotproduct <- function(x, y) {
    mz1 <- x[, "mz"]
    mz2 <- y[, "mz"]
    inten1 <- x[, "intensity"]
    inten2 <- y[, "intensity"]
    ws1 <- sqrt(inten1 / max(inten1, na.rm = TRUE) * 100)
    ws2 <- sqrt(inten2 / max(inten2, na.rm = TRUE) * 100)
    ## calculate normalized dot product
    sum(ws1 * ws2, na.rm = TRUE) ^ 2 /
        (sum(ws1 ^ 2, na.rm = TRUE) * sum(ws2 ^ 2, na.rm = TRUE))
}

.joinGraph <- function(x, y, ...) {
    jg <- joinGraph(x, y, ...)
    list(x = x[jg$x, ], y = y[jg$y, ])
}

all.equal(
    graphPeaks(x, y, ppm = 20, FUN = .dotproduct),
    .joinGraph(x, y, ppm = 20, FUN = .dotproduct)
)
# [1] "Component “x”: Attributes: < Component “dim”: Mean relative difference: 0.1666667 >"
# [2] "Component “x”: Numeric: lengths (12, 14) differ"
# [3] "Component “y”: Attributes: < Component “dim”: Mean relative difference: 0.1666667 >"
# [4] "Component “y”: Numeric: lengths (12, 14) differ"

## the order of the first mz values changed
## and there is also a > 20 ppm match in row 4
## ppm(300.01, 20) == 0.006; 300.01 + 0.006 = 300.016 < 300.02
graphPeaks(x, y, ppm = 20, FUN = .dotproduct)
# $x
#           mz intensity
# [1,] 100.002         9
# [2,] 100.001         1
# [3,]      NA         0
# [4,] 300.010         1
# [5,] 300.020         9
# [6,]      NA         0
#
# $y
#            mz intensity
# [1,] 100.0000         9
# [2,]       NA         0
# [3,] 200.0000         1
# [4,] 300.0020         1
# [5,] 300.0250         9
# [6,] 300.0255         1

joinGraph(x, y, ppm = 20, FUN = .dotproduct)
# $x
# [1]  1  2 NA NA  3  4 NA
#
# $y
# [1] NA  1  2  3 NA  4  5

.joinGraph(x, y, ppm = 20, FUN = .dotproduct)
# $x
#           mz intensity
# [1,] 100.001         1
# [2,] 100.002         9
# [3,]      NA        NA
# [4,]      NA        NA
# [5,] 300.010         1
# [6,] 300.020         9
# [7,]      NA        NA
#
# $y
#            mz intensity
# [1,]       NA        NA
# [2,] 100.0000         9
# [3,] 200.0000         1
# [4,] 300.0020         1
# [5,]       NA        NA
# [6,] 300.0250         9
# [7,] 300.0255         1

Because of less memory allocation and vectorisation it is faster.

library("microbenchmark")
microbenchmark(
    graphPeaks(x, y, ppm = 20, FUN = .dotproduct),
    .joinGraph(x, y, ppm = 20, FUN = .dotproduct)
)
# Unit: microseconds
#                                           expr      min       lq      mean
#  graphPeaks(x, y, ppm = 20, FUN = .dotproduct) 5860.683 6124.560 6459.0414
#  .joinGraph(x, y, ppm = 20, FUN = .dotproduct)  591.393  630.318  747.0722
#    median       uq       max neval
#  6232.061 6415.744 10155.622   100
#   693.135  733.113  3302.426   100

It also works with real (moderate small) spectra (depending on the number of possible combinations):

## prepare some real example data
library("mzR")
f <- msdata::proteomics("01.mzML.gz", full.names=TRUE)

ms <- openMSfile(f)
on.exit(close(ms))

p <- lapply(peaks(ms)[c(1, 41, 2, 3)], function(x) {
    colnames(x) <- c("mz", "intensity")
    x
})

lengths(p)
# [1] 87054 89584   422   248

system.time(graphPeaks(p[[1]], p[[2]], ppm = 20))
# Error: cannot allocate vector of size 58.1 Gb

system.time(joinGraph(p[[1]], p[[2]], ppm = 20))
# Error: too many combinations

system.time(graphPeaks(p[[3]], p[[4]], ppm = 20))
# Error: cannot allocate vector of size 37.3 Gb

system.time(joinGraph(p[[3]], p[[4]], ppm = 20))
#    user  system elapsed
#   0.001   0.000   0.001

But it is much much slower than the other simple join commands (as expected):

microbenchmark( 
    join(p[[3]][,1], p[[4]][,1], tol=0.2, ppm=0, type = "outer"), 
    joinGraph(p[[3]], p[[4]], tol=0.2, ppm=0), 
    times = 10
)
# Unit: microseconds
#                                                                expr         min
#  join(p[[3]][, 1], p[[4]][, 1], tol = 0.2, ppm = 0, type = "outer")     156.724
#                       joinGraph(p[[3]], p[[4]], tol = 0.2, ppm = 0) 1975215.020
#           lq         mean       median          uq         max neval
#      177.958     258.2854     266.6545     330.403     387.131    10
#  2105836.670 2256661.7154 2203902.8875 2503019.513 2548014.391    10

Things to discuss:

Add a type = "graph" to join and remove joinGraph
(this would require that the input arguments x and y are always matrices instead of numerics for join)
Often it would be impossible/impractical to test all possible combinations. Maybe it would be useful to divide the spectra in parts to reduce the number of combinations.
Provide more similarity methods (dotproduct, mindiff, ...)
Add a dotproduct2 or .dotproduct that doesn't run all these checks to save run time.

jorainer · 2019-10-29T13:58:44Z

Thanks for the PR @sgibb . Just some quick response to the discussions:

Add a type = "graph" to join and remove joinGraph
(this would require that the input arguments x and y are always matrices instead of numerics for join)

I would keep the functions separate, join should work with numeric to not restrict it to the use with peak matrices. I think there might be other application scenarios too.

tnaake · 2019-11-07T11:31:51Z

Dear @sgibb

thank you for inviting for the review (and sorry for late reply). Indeed I also think that graphPeaks is memory-consuming since we are dealing with huge adjacency matrices. Great idea to use an edge list instead.

It is the first time to review code, so I do not know how to use the GitHub-built in tools (I will not use them for the start).

I have two points for the beginning:

I am not sure about the following

## and there is also a > 20 ppm match in row 4
## ppm(300.01, 20) == 0.006; 300.01 + 0.006 = 300.016 < 300.02
graphPeaks(x, y, ppm = 20, FUN = .dotproduct)
# $x
#           mz intensity
# [1,] 100.002         9
# [2,] 100.001         1
# [3,]      NA         0
# [4,] 300.010         1
# [5,] 300.020         9
# [6,]      NA         0
#
# $y
#            mz intensity
# [1,] 100.0000         9
# [2,]       NA         0
# [3,] 200.0000         1
# [4,] 300.0020         1
# [5,] 300.0250         9
# [6,] 300.0255         1

In my opinion ppm should be applied to both m/z values

ppm(300.01, 20) ## 0.0060002
ppm(300.02, 20) ## 0.0060004

Thus the "true" value is 300.01+-0.006 and 300.02+-0.006 with an overlap between 300.014-300.016. Thus, there is a joint peak reported.

If I run the following using .joinPeaks:

x <- matrix(c(c(100.001, 100.0015, 100.002, 300.01, 300.02),
    c(1, 3, 3, 1, 1)), ncol = 2, nrow = 5, byrow = FALSE)
colnames(x) <- c("mz", "intensity")
 
y <- matrix(c(c(100.0, 200.0, 300.002, 300.025, 300.0255),
     c(1, 1, 1, 1, 1)), ncol = 2, nrow = 5, byrow = FALSE)
colnames(y) <- c("mz", "intensity")

.joinGraph(x, y, ppm = 20)

I get the following error:

Error in .transposeList(l) : 
  'e' has to be a list with elements of equal length.

On the contrary

x <- matrix(c(c(100.001, 100.002, 300.01, 300.02),
    c(1, 3, 1, 1)), ncol = 2, nrow = 4, byrow = FALSE)
colnames(x) <- c("mz", "intensity")
 
y <- matrix(c(c(100.0, 200.0, 300.002, 300.025, 300.0255),
     c(1, 1, 1, 1, 1)), ncol = 2, nrow = 5, byrow = FALSE)
colnames(y) <- c("mz", "intensity")

.joinGraph(x, y, ppm = 20)

works fine

Any ideas on that?

@tnaake

pointed out by @tnaake

sgibb · 2019-11-18T10:25:56Z

@tnaake thanks for looking into this

In my opinion ppm should be applied to both m/z values

I understand your point. I think this isn't a real problem. We have to document how ppm is defined.
My interpretation is 300.02 - 300.01 > ppm(300.01, 20) and yours is 300.02 - 300.01 < sum(ppm(c(300.01, 300.02), 20)).

@jorainer and @lgatto what is your opinion here?

BTW: I use closest internally which don't allow a tolerance/ppm error for x and table but the behaviour that @tnaake suggests should be more or less be possible with clostest(x, table, ppm = 2 * ppm) (of course not exactly because sum(ppm(c(300.01, 300.02), 20)) > 2 * ppm(300.01, 20) but we could use closest(x, table, ppm = 2.5 * ppm) and test all matches against diff(singleMatchMzs) < sum(ppm(singleMatchMzs, 20)) subsequently.

That are indeed two bugs. The first was in .combinations that I fixed with the latest commit. The second one is in .edgeGroups that isn't working for this specific case. I added a unit test with your example but have no time to fix it yet (the .edgeGroups function is the most important and tricky one, so if anyone has a good solution I am happy to hear it).

sgibb · 2019-11-18T10:32:49Z

@jorainer : regarding the common interface. I would like to use two matrices as input and provide a common interface for all joining functions. You use matrices in Spectra::joinPeaks anyway:

https://github.com/rformassspectrometry/Spectra/blob/1ff21ac05fbe531433c116b563ddba2cf1517382/R/peaks-functions.R#L213-L216

And I guess in Chromatograms we would have matrices as input, too.

sgibb · 2019-11-18T13:35:39Z

@tnaake now I fixed the .edgeGroup function (but it is not vectorized anymore, any input to vectorize it again is welcome!)

Now I got:

x <- matrix(c(c(100.001, 100.0015, 100.002, 300.01, 300.02),
    c(1, 3, 3, 1, 1)), ncol = 2, nrow = 5, byrow = FALSE)
colnames(x) <- c("mz", "intensity")

y <- matrix(c(c(100.0, 200.0, 300.002, 300.025, 300.0255),
     c(1, 1, 1, 1, 1)), ncol = 2, nrow = 5, byrow = FALSE)
colnames(y) <- c("mz", "intensity")


joinGraph(x, y, ppm = 20)
# $x
# [1]  1  2  3 NA NA  4  5 NA
#
# $y
# [1] NA  1 NA  2  3 NA  4  5
#
mindiff <- function(x, y)-sum(abs(x[, 2L] - y[, 2L]), na.rm = TRUE)
joinGraph(x, y, ppm = 20, FUN = mindiff)
# $x
# [1]  1  2  3 NA NA  4  5 NA
#
# $y
# [1]  1 NA NA  2  3 NA  4  5
#

.joinGraph(x, y, ppm = 20)
# $x
#            mz intensity
# [1,] 100.0010         1
# [2,] 100.0015         3
# [3,] 100.0020         3
# [4,]       NA        NA
# [5,]       NA        NA
# [6,] 300.0100         1
# [7,] 300.0200         1
# [8,]       NA        NA
# 
# $y
#            mz intensity
# [1,]       NA        NA
# [2,] 100.0000         1
# [3,]       NA        NA
# [4,] 200.0000         1
# [5,] 300.0020         1
# [6,]       NA        NA
# [7,] 300.0250         1
# [8,] 300.0255         1
# 
.joinGraph(x, y, ppm = 20, FUN = mindiff)
# $x
#            mz intensity
# [1,] 100.0010         1
# [2,] 100.0015         3
# [3,] 100.0020         3
# [4,]       NA        NA
# [5,]       NA        NA
# [6,] 300.0100         1
# [7,] 300.0200         1
# [8,]       NA        NA
# 
# $y
#            mz intensity
# [1,] 100.0000         1
# [2,]       NA        NA
# [3,]       NA        NA
# [4,] 200.0000         1
# [5,] 300.0020         1
# [6,]       NA        NA
# [7,] 300.0250         1
# [8,] 300.0255         1
#

codecov-io · 2019-11-18T13:37:30Z

Codecov Report

Merging #25 into master will decrease coverage by 0.25%.
The diff coverage is 98.82%.

@@            Coverage Diff             @@
##           master      #25      +/-   ##
==========================================
- Coverage     100%   99.74%   -0.26%     
==========================================
  Files          20       22       +2     
  Lines         306      392      +86     
==========================================
+ Hits          306      391      +85     
- Misses          0        1       +1

Impacted Files	Coverage Δ
R/utils.R	`100% <100%> (ø)`	⬆️
R/matching.R	`100% <100%> (ø)`	⬆️
R/joinGraph.R	`98.14% <98.14%> (ø)`
R/group.R	`100% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc0363b...0a72dad. Read the comment docs.

jorainer · 2019-12-06T08:54:34Z

@sgibb , sorry for the late reply. Regarding the use of matrices, if we have to use matrices then go ahead and use them. But you're right, then we have to use matrices also for all other joining functions. I just wanted to keep them as generic as possible to allow also their use if you have only numeric vectors. Maybe let's discuss in detail at the EuroBioc2019.

tnaake · 2019-12-30T12:04:09Z

Hi @sgibb,

many thanks for looking into this.

Here are some comments:

in joinPeaks the function validPeaksMatrix is applied every time for each object x and y

validPeaksMatrix(x)
validPeaksMatrix(y)

It is only a small check, but I am wondering if there can be a upstream function for checking the validity of spectra matrices, i.e. checking them on a list of matrices instead of matrices separately. Running validPeaksMatrix on 1000 spectra causes n**2 checks, while running this on the list only requires n checks.

x <- matrix(c(c(100.001, 100.002, 300.01, 300.02),
    c(1, 3, 1, 1)), ncol = 2, nrow = 4, byrow = FALSE)
colnames(x) <- c("mz", "intensity")
microbenchmark(validPeaksMatrix(x))
Unit: microseconds
               expr   min    lq    mean median    uq    max neval
 validPeaksMatrix(x) 7.242 7.534 8.13696  7.621 7.759 40.277   10

For 1000 spectra, this will result in 8 s or 0.008 s checks, respectively. It is not a lot of time I would say, but probably can be improved.

Given the arguments:
x <- matrix(c(c(100.0011, 100.0011),
c(1, 1)), ncol = 2, nrow = 2, byrow = FALSE)
colnames(x) <- c("mz", "intensity")

I receive the following output:

joinGraph(x, x)
$x
[1] 1 2 1
$y
[1] 1 NA 2

and for .joinGraph

.joinGraph(x, x)
$x
mz intensity
[1,] 100.0011 1
[2,] 100.0011 1
[3,] 100.0011 1
$y
mz intensity
[1,] 100.0011 1
[2,] NA NA
[3,] 100.0011 1

Why one entry is duplicated, i.e. three peaks are returned instead of two for $x? Probably duplicated peaks will not happen when working with real spectra, but it appears to be a bug.

Concerning the ppm it should be changed that ppm is applied to all peaks, instead of only to one (see above). Since the MS cannot determine the exact m/z on only one, but always on two peaks. @jorainer and @lgatto any comments on this?
@sgibb could you add some inline documentation for the function closest in .edgeList? It is a bit hard for me to understand what is going on in this function. Especially what is going on with the function findInterval
In .combinations add space between / and characters before and after for

ncmb <- ncmb/n

jorainer · 2020-01-07T07:16:48Z

Concerning the ppm it should be changed that ppm is applied to all peaks, instead of only to one (see above). Since the MS cannot determine the exact m/z on only one, but always on two peaks.

@tnaake, could you elaborate a little on this? What exactly do you mean here? Changing the ppm function? As far as I know the ppm is a relative error on a single m/z value (or at least that is what I have been told).

tnaake · 2020-01-08T19:33:38Z

Dear @jorainer
sure. There is no need to change the ppm function. ppm is a random relative error on m/z values that is given by the accuracy of the mass spectrometer. Using the joinGraphfunction above, we would like to align the spectra of two spectra. In order to do this we need to give to the function a ppm value that tells joinGraph how accurately the spectra were obtained . Thus, for two peaks from two spectra given a certain difference the two peaks will be aligned or not. For example, should the peak with m/z 100.010 of spectrum 1 and the peak with m/z value 100.011 of spectrum 2 be regarded as originating from the same theoretical m/z value.

The question now is to which peak(s) do we apply the ppm function? Only on peaks of one of the spectra (if I am right this is how it is currently implemented) or on peaks of all spectra?

In my point of view, the ppm function should be applied to all peaks in all spectra, since the error operated on all obtained peaks. This will result in a lower and upper value of possible 'true' m/z values per peak in all spectra that can be aligned by joinGraph. If we use ppm only on peaks of one spectra this would mean that for the other peaks of the second spectra, we are sure that these are the true/accurate values, which is not true.

I hope it is a bit clearer what I meant by this comment now.

jorainer · 2020-01-13T13:32:28Z

Thanks @tnaake for the explanation. I completely agree with you. The ppm should be applied to all (both) spectra in the alignment - and I thought we have implemented it that way. I will cross check.

jorainer · 2020-01-13T13:54:35Z

I had a look at the joinPeaks function, and also there we are applying the ppm only once. I see here two solutions to the problem:

we only apply ppm once (i.e. to one spectrum) as it is now but clearly describe this in the documentation.
we use 2 * ppm in the comparison (e.g. in line https://github.com/rformassspectrometry/MsCoreUtils/blob/master/R/matching.R#L105 of joinPeaks). This is not exactly the same as applying ppm to the m/z values of spectrum 1 and to the m/z values of spectrum 2, but should be close enough (which I would accept because applying ppm to each spectrum separately will have a bigger impact on performance). Again, we have to clearly state that in the documentation.

Regarding the tolerance parameter, I would apply that one only once (as it is supposed to be an absolute, constant difference, while the ppm is supposed to be a m/z relative measurement error of the machine).

@sgibb , @lgatto and @tnaake what are your thoughts about that?

tnaake · 2020-01-20T18:23:28Z

I am voting for the 2nd solution. The error should be minimal. I agree that this should be properly documented.

jorainer

All OK from my side. The problem with travis is strange, especially because in the jomaster branch I don't have it (yet?).

sgibb added 26 commits October 24, 2019 23:24

feat: add .orderEdges

e78cf31

refactor: use .orderEdges for .joinOuter

9c05146

feat: add .edgeList

020ccce

docs: fix join return value

0a588ac

docs: don't create man pages for internal functions

6bd7553

perf: create list instead of matrix

ba7e450

feat: add .anyCrossing

8e4a788

feat: add .edgeGroups

40534a4

feat: add .combinations

4215ab1

feat: add .transposeList

4205cef

tests: add joinGraph unit test

066be78

fix: remove useless na.rm argument

341b6ad

fix: correctly handle mutliple NA

f0d288e

refactor: use vectors as input for .anyCrossing

517e865

feat: add is{Precursor,Follower}Identical

0bd04da

feat: add .edgeGroupFrom

b40dcd3

tests: add additional unit test for joinGraph

ed28fc7

feat: throw an error for too many combinations

01cd683

tests: fix joinGraph unit tests

99f479d

feat: add joinGraph

94c1b3b

docs: add author information

d5ff969

chore: move some joinGraph helper functions to utils.R

b516087

tests: fix .combinations error message

8c27b15

docs: run roxygenise

94421ff

fix: import stats::setNames

3e557b4

docs: add function name to authors field

734d76d

sgibb added the enhancement New feature or request label Oct 25, 2019

sgibb requested a review from jorainer October 25, 2019 10:06

sgibb self-assigned this Oct 25, 2019

sgibb mentioned this pull request Oct 25, 2019

graphPeaks (recreate PR after merging) #23

Closed

sgibb requested a review from tnaake October 26, 2019 05:38

sgibb added 2 commits November 18, 2019 10:20

fix: .combinations generated list with wrong length

e0a2c06

pointed out by @tnaake

tests: add unit test for .edgeGroups bug

5a50c3b

fix: .edgeGroup, not vectorized anymore

0a72dad

sgibb added 4 commits March 5, 2020 13:11

Merge branch 'master' into joinGraph

d822445

fix: treat equal indicies as crossing

e201cb1

fix: throw a warning if no valid match was found

8495a44

refactor: remove validPeaksMatrix call

096ff09

jorainer approved these changes Mar 9, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add graph base matching #25

Add graph base matching #25

sgibb commented Oct 25, 2019

jorainer commented Oct 29, 2019

tnaake commented Nov 7, 2019

sgibb commented Nov 18, 2019

sgibb commented Nov 18, 2019

sgibb commented Nov 18, 2019

codecov-io commented Nov 18, 2019 •

edited

Loading

jorainer commented Dec 6, 2019

tnaake commented Dec 30, 2019

jorainer commented Jan 7, 2020

tnaake commented Jan 8, 2020

jorainer commented Jan 13, 2020

jorainer commented Jan 13, 2020

tnaake commented Jan 20, 2020

jorainer left a comment

Add graph base matching #25

Are you sure you want to change the base?

Add graph base matching #25

Conversation

sgibb commented Oct 25, 2019

jorainer commented Oct 29, 2019

tnaake commented Nov 7, 2019

sgibb commented Nov 18, 2019

sgibb commented Nov 18, 2019

sgibb commented Nov 18, 2019

codecov-io commented Nov 18, 2019 • edited Loading

Codecov Report

jorainer commented Dec 6, 2019

tnaake commented Dec 30, 2019

jorainer commented Jan 7, 2020

tnaake commented Jan 8, 2020

jorainer commented Jan 13, 2020

jorainer commented Jan 13, 2020

tnaake commented Jan 20, 2020

jorainer left a comment

Choose a reason for hiding this comment

codecov-io commented Nov 18, 2019 •

edited

Loading