Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add graph base matching #25

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open

Add graph base matching #25

wants to merge 33 commits into from

Conversation

sgibb
Copy link
Member

@sgibb sgibb commented Oct 25, 2019

This PR introduces joinGraph an alternative implementation for the idea of graph based matching as described by @tnaake in rformassspectrometry/Spectra#49 and proposed in PR #23

I am sorry but I don't want to accept @tnaake's PR #23 because of several problems.

  1. Even after reading it multiple times I don't understand the code completely. So I don't want to maintain it.
    (Ok, you could argue that you don't understand my code either.)
  2. It doesn't work on my laptop except for toy examples because of the huge amount of memory it requires.
  3. It changes the order of the mz values in spectra.

There are a few notable difference between my and @tnaake's approach:

  1. I construct an edge list instead of an adjacency matrix which results in much less memory consumption.
  2. I don't use the igraph package because the edge list contains already all information we need for an undirected graph.
  3. For a smaller memory footprint I just pre-calculate the combinations of multiple matches (and reuse the distinct matches from the initial edge list).

@tnaake: while I won't accept your PR I greatly appreciate your contributions and would be very happy if you would be so kind to test this function with your data and review the code (I will add you to the reviewer as soon as you accept the collaborator invitation).


Toy example:

library("MsCoreUtils")
library("igraph")

source("playground/graphPeaks.R")

## toy example
x <- matrix(
     c(100.001, 100.002, 300.01, 300.02, 1, 9, 1, 9),
       ncol = 2L, dimnames = list(c(), c("mz", "intensity"))
)
y <- matrix(
    c(100.0, 200.0, 300.002, 300.025, 300.0255, 9, 1, 1, 9, 1),
    ncol = 2L, dimnames = list(c(), c("mz", "intensity"))
)

## the original dotproduct isn't working because `validPeaksMatrix` fails in
## `graphPeaks` (because the order of mz values changed)
.dotproduct <- function(x, y) {
    mz1 <- x[, "mz"]
    mz2 <- y[, "mz"]
    inten1 <- x[, "intensity"]
    inten2 <- y[, "intensity"]
    ws1 <- sqrt(inten1 / max(inten1, na.rm = TRUE) * 100)
    ws2 <- sqrt(inten2 / max(inten2, na.rm = TRUE) * 100)
    ## calculate normalized dot product
    sum(ws1 * ws2, na.rm = TRUE) ^ 2 /
        (sum(ws1 ^ 2, na.rm = TRUE) * sum(ws2 ^ 2, na.rm = TRUE))
}

.joinGraph <- function(x, y, ...) {
    jg <- joinGraph(x, y, ...)
    list(x = x[jg$x, ], y = y[jg$y, ])
}

all.equal(
    graphPeaks(x, y, ppm = 20, FUN = .dotproduct),
    .joinGraph(x, y, ppm = 20, FUN = .dotproduct)
)
# [1] "Component “x”: Attributes: < Component “dim”: Mean relative difference: 0.1666667 >"
# [2] "Component “x”: Numeric: lengths (12, 14) differ"
# [3] "Component “y”: Attributes: < Component “dim”: Mean relative difference: 0.1666667 >"
# [4] "Component “y”: Numeric: lengths (12, 14) differ"

## the order of the first mz values changed
## and there is also a > 20 ppm match in row 4
## ppm(300.01, 20) == 0.006; 300.01 + 0.006 = 300.016 < 300.02
graphPeaks(x, y, ppm = 20, FUN = .dotproduct)
# $x
#           mz intensity
# [1,] 100.002         9
# [2,] 100.001         1
# [3,]      NA         0
# [4,] 300.010         1
# [5,] 300.020         9
# [6,]      NA         0
#
# $y
#            mz intensity
# [1,] 100.0000         9
# [2,]       NA         0
# [3,] 200.0000         1
# [4,] 300.0020         1
# [5,] 300.0250         9
# [6,] 300.0255         1

joinGraph(x, y, ppm = 20, FUN = .dotproduct)
# $x
# [1]  1  2 NA NA  3  4 NA
#
# $y
# [1] NA  1  2  3 NA  4  5

.joinGraph(x, y, ppm = 20, FUN = .dotproduct)
# $x
#           mz intensity
# [1,] 100.001         1
# [2,] 100.002         9
# [3,]      NA        NA
# [4,]      NA        NA
# [5,] 300.010         1
# [6,] 300.020         9
# [7,]      NA        NA
#
# $y
#            mz intensity
# [1,]       NA        NA
# [2,] 100.0000         9
# [3,] 200.0000         1
# [4,] 300.0020         1
# [5,]       NA        NA
# [6,] 300.0250         9
# [7,] 300.0255         1

Because of less memory allocation and vectorisation it is faster.

library("microbenchmark")
microbenchmark(
    graphPeaks(x, y, ppm = 20, FUN = .dotproduct),
    .joinGraph(x, y, ppm = 20, FUN = .dotproduct)
)
# Unit: microseconds
#                                           expr      min       lq      mean
#  graphPeaks(x, y, ppm = 20, FUN = .dotproduct) 5860.683 6124.560 6459.0414
#  .joinGraph(x, y, ppm = 20, FUN = .dotproduct)  591.393  630.318  747.0722
#    median       uq       max neval
#  6232.061 6415.744 10155.622   100
#   693.135  733.113  3302.426   100

It also works with real (moderate small) spectra (depending on the number of possible combinations):

## prepare some real example data
library("mzR")
f <- msdata::proteomics("01.mzML.gz", full.names=TRUE)

ms <- openMSfile(f)
on.exit(close(ms))

p <- lapply(peaks(ms)[c(1, 41, 2, 3)], function(x) {
    colnames(x) <- c("mz", "intensity")
    x
})

lengths(p)
# [1] 87054 89584   422   248

system.time(graphPeaks(p[[1]], p[[2]], ppm = 20))
# Error: cannot allocate vector of size 58.1 Gb

system.time(joinGraph(p[[1]], p[[2]], ppm = 20))
# Error: too many combinations

system.time(graphPeaks(p[[3]], p[[4]], ppm = 20))
# Error: cannot allocate vector of size 37.3 Gb

system.time(joinGraph(p[[3]], p[[4]], ppm = 20))
#    user  system elapsed
#   0.001   0.000   0.001

But it is much much slower than the other simple join commands (as expected):

microbenchmark( 
    join(p[[3]][,1], p[[4]][,1], tol=0.2, ppm=0, type = "outer"), 
    joinGraph(p[[3]], p[[4]], tol=0.2, ppm=0), 
    times = 10
)
# Unit: microseconds
#                                                                expr         min
#  join(p[[3]][, 1], p[[4]][, 1], tol = 0.2, ppm = 0, type = "outer")     156.724
#                       joinGraph(p[[3]], p[[4]], tol = 0.2, ppm = 0) 1975215.020
#           lq         mean       median          uq         max neval
#      177.958     258.2854     266.6545     330.403     387.131    10
#  2105836.670 2256661.7154 2203902.8875 2503019.513 2548014.391    10

Things to discuss:

  • Add a type = "graph" to join and remove joinGraph
    (this would require that the input arguments x and y are always matrices instead of numerics for join)
  • Often it would be impossible/impractical to test all possible combinations. Maybe it would be useful to divide the spectra in parts to reduce the number of combinations.
  • Provide more similarity methods (dotproduct, mindiff, ...)
  • Add a dotproduct2 or .dotproduct that doesn't run all these checks to save run time.

@sgibb sgibb added the enhancement New feature or request label Oct 25, 2019
@sgibb sgibb requested a review from jorainer October 25, 2019 10:06
@sgibb sgibb self-assigned this Oct 25, 2019
@sgibb sgibb requested a review from tnaake October 26, 2019 05:38
@jorainer
Copy link
Member

Thanks for the PR @sgibb . Just some quick response to the discussions:

Add a type = "graph" to join and remove joinGraph
(this would require that the input arguments x and y are always matrices instead of numerics for join)

I would keep the functions separate, join should work with numeric to not restrict it to the use with peak matrices. I think there might be other application scenarios too.

@tnaake
Copy link
Collaborator

tnaake commented Nov 7, 2019

Dear @sgibb

thank you for inviting for the review (and sorry for late reply). Indeed I also think that graphPeaks is memory-consuming since we are dealing with huge adjacency matrices. Great idea to use an edge list instead.

It is the first time to review code, so I do not know how to use the GitHub-built in tools (I will not use them for the start).

I have two points for the beginning:

  1. I am not sure about the following
## and there is also a > 20 ppm match in row 4
## ppm(300.01, 20) == 0.006; 300.01 + 0.006 = 300.016 < 300.02
graphPeaks(x, y, ppm = 20, FUN = .dotproduct)
# $x
#           mz intensity
# [1,] 100.002         9
# [2,] 100.001         1
# [3,]      NA         0
# [4,] 300.010         1
# [5,] 300.020         9
# [6,]      NA         0
#
# $y
#            mz intensity
# [1,] 100.0000         9
# [2,]       NA         0
# [3,] 200.0000         1
# [4,] 300.0020         1
# [5,] 300.0250         9
# [6,] 300.0255         1

In my opinion ppm should be applied to both m/z values

ppm(300.01, 20) ## 0.0060002
ppm(300.02, 20) ## 0.0060004

Thus the "true" value is 300.01+-0.006 and 300.02+-0.006 with an overlap between 300.014-300.016. Thus, there is a joint peak reported.

  1. If I run the following using .joinPeaks:
x <- matrix(c(c(100.001, 100.0015, 100.002, 300.01, 300.02),
    c(1, 3, 3, 1, 1)), ncol = 2, nrow = 5, byrow = FALSE)
colnames(x) <- c("mz", "intensity")
 
y <- matrix(c(c(100.0, 200.0, 300.002, 300.025, 300.0255),
     c(1, 1, 1, 1, 1)), ncol = 2, nrow = 5, byrow = FALSE)
colnames(y) <- c("mz", "intensity")

.joinGraph(x, y, ppm = 20)

I get the following error:

Error in .transposeList(l) : 
  'e' has to be a list with elements of equal length.

On the contrary

x <- matrix(c(c(100.001, 100.002, 300.01, 300.02),
    c(1, 3, 1, 1)), ncol = 2, nrow = 4, byrow = FALSE)
colnames(x) <- c("mz", "intensity")
 
y <- matrix(c(c(100.0, 200.0, 300.002, 300.025, 300.0255),
     c(1, 1, 1, 1, 1)), ncol = 2, nrow = 5, byrow = FALSE)
colnames(y) <- c("mz", "intensity")

.joinGraph(x, y, ppm = 20)

works fine

Any ideas on that?

@sgibb
Copy link
Member Author

sgibb commented Nov 18, 2019

@tnaake thanks for looking into this

  1. In my opinion ppm should be applied to both m/z values

I understand your point. I think this isn't a real problem. We have to document how ppm is defined.
My interpretation is 300.02 - 300.01 > ppm(300.01, 20) and yours is 300.02 - 300.01 < sum(ppm(c(300.01, 300.02), 20)).

@jorainer and @lgatto what is your opinion here?

BTW: I use closest internally which don't allow a tolerance/ppm error for x and table but the behaviour that @tnaake suggests should be more or less be possible with clostest(x, table, ppm = 2 * ppm) (of course not exactly because sum(ppm(c(300.01, 300.02), 20)) > 2 * ppm(300.01, 20) but we could use closest(x, table, ppm = 2.5 * ppm) and test all matches against diff(singleMatchMzs) < sum(ppm(singleMatchMzs, 20)) subsequently.

  1. That are indeed two bugs. The first was in .combinations that I fixed with the latest commit. The second one is in .edgeGroups that isn't working for this specific case. I added a unit test with your example but have no time to fix it yet (the .edgeGroups function is the most important and tricky one, so if anyone has a good solution I am happy to hear it).

@sgibb
Copy link
Member Author

sgibb commented Nov 18, 2019

@jorainer : regarding the common interface. I would like to use two matrices as input and provide a common interface for all joining functions. You use matrices in Spectra::joinPeaks anyway:

https://github.com/rformassspectrometry/Spectra/blob/1ff21ac05fbe531433c116b563ddba2cf1517382/R/peaks-functions.R#L213-L216

And I guess in Chromatograms we would have matrices as input, too.

@sgibb
Copy link
Member Author

sgibb commented Nov 18, 2019

@tnaake now I fixed the .edgeGroup function (but it is not vectorized anymore, any input to vectorize it again is welcome!)

Now I got:

x <- matrix(c(c(100.001, 100.0015, 100.002, 300.01, 300.02),
    c(1, 3, 3, 1, 1)), ncol = 2, nrow = 5, byrow = FALSE)
colnames(x) <- c("mz", "intensity")

y <- matrix(c(c(100.0, 200.0, 300.002, 300.025, 300.0255),
     c(1, 1, 1, 1, 1)), ncol = 2, nrow = 5, byrow = FALSE)
colnames(y) <- c("mz", "intensity")


joinGraph(x, y, ppm = 20)
# $x
# [1]  1  2  3 NA NA  4  5 NA
#
# $y
# [1] NA  1 NA  2  3 NA  4  5
#
mindiff <- function(x, y)-sum(abs(x[, 2L] - y[, 2L]), na.rm = TRUE)
joinGraph(x, y, ppm = 20, FUN = mindiff)
# $x
# [1]  1  2  3 NA NA  4  5 NA
#
# $y
# [1]  1 NA NA  2  3 NA  4  5
#

.joinGraph(x, y, ppm = 20)
# $x
#            mz intensity
# [1,] 100.0010         1
# [2,] 100.0015         3
# [3,] 100.0020         3
# [4,]       NA        NA
# [5,]       NA        NA
# [6,] 300.0100         1
# [7,] 300.0200         1
# [8,]       NA        NA
# 
# $y
#            mz intensity
# [1,]       NA        NA
# [2,] 100.0000         1
# [3,]       NA        NA
# [4,] 200.0000         1
# [5,] 300.0020         1
# [6,]       NA        NA
# [7,] 300.0250         1
# [8,] 300.0255         1
# 
.joinGraph(x, y, ppm = 20, FUN = mindiff)
# $x
#            mz intensity
# [1,] 100.0010         1
# [2,] 100.0015         3
# [3,] 100.0020         3
# [4,]       NA        NA
# [5,]       NA        NA
# [6,] 300.0100         1
# [7,] 300.0200         1
# [8,]       NA        NA
# 
# $y
#            mz intensity
# [1,] 100.0000         1
# [2,]       NA        NA
# [3,]       NA        NA
# [4,] 200.0000         1
# [5,] 300.0020         1
# [6,]       NA        NA
# [7,] 300.0250         1
# [8,] 300.0255         1
# 

@codecov-io
Copy link

codecov-io commented Nov 18, 2019

Codecov Report

Merging #25 into master will decrease coverage by 0.25%.
The diff coverage is 98.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #25      +/-   ##
==========================================
- Coverage     100%   99.74%   -0.26%     
==========================================
  Files          20       22       +2     
  Lines         306      392      +86     
==========================================
+ Hits          306      391      +85     
- Misses          0        1       +1
Impacted Files Coverage Δ
R/utils.R 100% <100%> (ø) ⬆️
R/matching.R 100% <100%> (ø) ⬆️
R/joinGraph.R 98.14% <98.14%> (ø)
R/group.R 100% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc0363b...0a72dad. Read the comment docs.

@jorainer
Copy link
Member

jorainer commented Dec 6, 2019

@sgibb , sorry for the late reply. Regarding the use of matrices, if we have to use matrices then go ahead and use them. But you're right, then we have to use matrices also for all other joining functions. I just wanted to keep them as generic as possible to allow also their use if you have only numeric vectors. Maybe let's discuss in detail at the EuroBioc2019.

@tnaake
Copy link
Collaborator

tnaake commented Dec 30, 2019

Hi @sgibb,

many thanks for looking into this.

Here are some comments:

  1. in joinPeaks the function validPeaksMatrix is applied every time for each object x and y

validPeaksMatrix(x)
validPeaksMatrix(y)

It is only a small check, but I am wondering if there can be a upstream function for checking the validity of spectra matrices, i.e. checking them on a list of matrices instead of matrices separately. Running validPeaksMatrix on 1000 spectra causes n**2 checks, while running this on the list only requires n checks.

x <- matrix(c(c(100.001, 100.002, 300.01, 300.02),
    c(1, 3, 1, 1)), ncol = 2, nrow = 4, byrow = FALSE)
colnames(x) <- c("mz", "intensity")
microbenchmark(validPeaksMatrix(x))
Unit: microseconds
               expr   min    lq    mean median    uq    max neval
 validPeaksMatrix(x) 7.242 7.534 8.13696  7.621 7.759 40.277   10

For 1000 spectra, this will result in 8 s or 0.008 s checks, respectively. It is not a lot of time I would say, but probably can be improved.

  1. Given the arguments:
    x <- matrix(c(c(100.0011, 100.0011),
    c(1, 1)), ncol = 2, nrow = 2, byrow = FALSE)
    colnames(x) <- c("mz", "intensity")

I receive the following output:

joinGraph(x, x)
$x
[1] 1 2 1
$y
[1] 1 NA 2

and for .joinGraph

.joinGraph(x, x)
$x
mz intensity
[1,] 100.0011 1
[2,] 100.0011 1
[3,] 100.0011 1
$y
mz intensity
[1,] 100.0011 1
[2,] NA NA
[3,] 100.0011 1

Why one entry is duplicated, i.e. three peaks are returned instead of two for $x? Probably duplicated peaks will not happen when working with real spectra, but it appears to be a bug.

  1. Concerning the ppm it should be changed that ppm is applied to all peaks, instead of only to one (see above). Since the MS cannot determine the exact m/z on only one, but always on two peaks. @jorainer and @lgatto any comments on this?

  2. @sgibb could you add some inline documentation for the function closest in .edgeList? It is a bit hard for me to understand what is going on in this function. Especially what is going on with the function findInterval

  3. In .combinations add space between / and characters before and after for

ncmb <- ncmb/n

@jorainer
Copy link
Member

jorainer commented Jan 7, 2020

Concerning the ppm it should be changed that ppm is applied to all peaks, instead of only to one (see above). Since the MS cannot determine the exact m/z on only one, but always on two peaks.

@tnaake, could you elaborate a little on this? What exactly do you mean here? Changing the ppm function? As far as I know the ppm is a relative error on a single m/z value (or at least that is what I have been told).

@tnaake
Copy link
Collaborator

tnaake commented Jan 8, 2020

Dear @jorainer
sure. There is no need to change the ppm function. ppm is a random relative error on m/z values that is given by the accuracy of the mass spectrometer. Using the joinGraphfunction above, we would like to align the spectra of two spectra. In order to do this we need to give to the function a ppm value that tells joinGraph how accurately the spectra were obtained . Thus, for two peaks from two spectra given a certain difference the two peaks will be aligned or not. For example, should the peak with m/z 100.010 of spectrum 1 and the peak with m/z value 100.011 of spectrum 2 be regarded as originating from the same theoretical m/z value.

The question now is to which peak(s) do we apply the ppm function? Only on peaks of one of the spectra (if I am right this is how it is currently implemented) or on peaks of all spectra?

In my point of view, the ppm function should be applied to all peaks in all spectra, since the error operated on all obtained peaks. This will result in a lower and upper value of possible 'true' m/z values per peak in all spectra that can be aligned by joinGraph. If we use ppm only on peaks of one spectra this would mean that for the other peaks of the second spectra, we are sure that these are the true/accurate values, which is not true.

I hope it is a bit clearer what I meant by this comment now.

@jorainer
Copy link
Member

Thanks @tnaake for the explanation. I completely agree with you. The ppm should be applied to all (both) spectra in the alignment - and I thought we have implemented it that way. I will cross check.

@jorainer
Copy link
Member

I had a look at the joinPeaks function, and also there we are applying the ppm only once. I see here two solutions to the problem:

  1. we only apply ppm once (i.e. to one spectrum) as it is now but clearly describe this in the documentation.

  2. we use 2 * ppm in the comparison (e.g. in line https://github.com/rformassspectrometry/MsCoreUtils/blob/master/R/matching.R#L105 of joinPeaks). This is not exactly the same as applying ppm to the m/z values of spectrum 1 and to the m/z values of spectrum 2, but should be close enough (which I would accept because applying ppm to each spectrum separately will have a bigger impact on performance). Again, we have to clearly state that in the documentation.

Regarding the tolerance parameter, I would apply that one only once (as it is supposed to be an absolute, constant difference, while the ppm is supposed to be a m/z relative measurement error of the machine).

@sgibb , @lgatto and @tnaake what are your thoughts about that?

@tnaake
Copy link
Collaborator

tnaake commented Jan 20, 2020

I am voting for the 2nd solution. The error should be minimal. I agree that this should be properly documented.

Copy link
Member

@jorainer jorainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All OK from my side. The problem with travis is strange, especially because in the jomaster branch I don't have it (yet?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants