Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify EL batching to doc-wise streaming approach #12367

Merged
merged 109 commits into from
Apr 9, 2024
Merged
Show file tree
Hide file tree
Changes from 97 commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
cd98ab4
Convert Candidate from Cython to Python class.
rmitsch Feb 28, 2023
5a9d8ba
Format.
rmitsch Feb 28, 2023
a97ef65
Fix .entity_ typo in _add_activations() usage.
rmitsch Feb 28, 2023
8596fb8
Change type for mentions to look up entity candidates for to SpanGrou…
rmitsch Feb 28, 2023
50b3475
Update docs.
rmitsch Feb 28, 2023
0680958
Update spacy/kb/candidate.py
rmitsch Mar 1, 2023
3da0712
Update doc string of BaseCandidate.__init__().
rmitsch Mar 1, 2023
21fa22d
Merge branch 'refactor/el-candidates' of github.com:rmitsch/spaCy int…
rmitsch Mar 1, 2023
417e8fe
Update spacy/kb/candidate.py
rmitsch Mar 1, 2023
49abf4f
Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.
rmitsch Mar 1, 2023
fa39061
Adjust Candidate to support and mandate numerical entity IDs.
rmitsch Mar 1, 2023
257bca3
Format.
rmitsch Mar 1, 2023
9bd498c
Fix docstring and docs.
rmitsch Mar 1, 2023
3beda2b
Merge branch 'refactor/el-candidates' into refactor/span-group-for-me…
rmitsch Mar 3, 2023
61bacf8
Update website/docs/api/kb.mdx
rmitsch Mar 3, 2023
46fe069
Rename alias -> mention.
rmitsch Mar 3, 2023
94e57d0
Refactor Candidate attribute names. Update docs and tests accordingly.
rmitsch Mar 3, 2023
38dce96
Refacor Candidate attributes and their usage.
rmitsch Mar 5, 2023
5f40b3e
Format.
rmitsch Mar 5, 2023
670e1ca
Fix mypy error.
rmitsch Mar 5, 2023
2ac586f
Update error code in line with v4 convention.
rmitsch Mar 5, 2023
bb7418e
Modify EL batching system.
rmitsch Mar 3, 2023
e4e55b8
Update leftover get_candidates() mention in docs.
rmitsch Mar 6, 2023
f33f0ed
Merge branch 'v4' into feature/docwise-generator-batching
rmitsch Mar 6, 2023
8b24f31
Format docs.
rmitsch Mar 6, 2023
d0abc32
Format.
rmitsch Mar 6, 2023
4bdb359
Merge branch 'v4' into feature/docwise-generator-batching
rmitsch Mar 7, 2023
8dbb74c
Merge branch 'v4' into refactor/el-candidates
rmitsch Mar 7, 2023
082992a
Update spacy/kb/candidate.py
rmitsch Mar 7, 2023
f8a02f7
Updated error code.
rmitsch Mar 7, 2023
0c63940
Merge branch 'v4' into refactor/el-candidates
rmitsch Mar 7, 2023
cea58ad
Simplify interface for int/str representations.
rmitsch Mar 7, 2023
1ba2fc4
Update website/docs/api/kb.mdx
rmitsch Mar 9, 2023
1c937db
Rename 'alias' to 'mention'.
rmitsch Mar 9, 2023
b476041
Port Candidate and InMemoryCandidate to Cython.
rmitsch Mar 9, 2023
845864b
Remove redundant entry in setup.py.
rmitsch Mar 9, 2023
b0ee341
Add abstract class check.
rmitsch Mar 9, 2023
c61654e
Drop storing mention.
rmitsch Mar 9, 2023
34e092e
Update spacy/kb/candidate.pxd
rmitsch Mar 9, 2023
6fc7997
Fix entity_id refactoring problems in docstrings.
rmitsch Mar 10, 2023
2705391
Drop unused InMemoryCandidate._entity_hash.
rmitsch Mar 10, 2023
348dd1c
Update docstrings.
rmitsch Mar 10, 2023
ce23942
Merge branch 'refactor/el-candidates' of github.com:rmitsch/spaCy int…
rmitsch Mar 10, 2023
649c146
Move attributes out of Candidate.
rmitsch Mar 13, 2023
6adc151
Partially fix alias/mention terminology usage. Convert Candidate to i…
rmitsch Mar 13, 2023
4a92176
Remove prior_prob from supported properties in Candidate. Introduce K…
rmitsch Mar 13, 2023
be85898
Update docstrings related to prior_prob.
rmitsch Mar 13, 2023
28dbed6
Update alias/mention usage in doc(strings).
rmitsch Mar 14, 2023
b7b4282
Update spacy/ml/models/entity_linker.py
rmitsch Mar 15, 2023
961795d
Update spacy/ml/models/entity_linker.py
rmitsch Mar 15, 2023
3cfc1c6
Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLo…
rmitsch Mar 15, 2023
80fb066
Update docstrings.
rmitsch Mar 15, 2023
830939e
Fix InMemoryCandidate attribute names.
rmitsch Mar 15, 2023
978fbdc
Update spacy/kb/kb.pyx
rmitsch Mar 17, 2023
307bbab
Update spacy/ml/models/entity_linker.py
rmitsch Mar 17, 2023
2377b67
Update W401 test.
rmitsch Mar 17, 2023
4d8dce5
Update spacy/errors.py
rmitsch Mar 17, 2023
faede71
Update spacy/kb/kb.pyx
rmitsch Mar 17, 2023
9e71adc
Use Candidate output type for toy generators in the test suite to mim…
svlandeg Mar 19, 2023
0365d3d
fix docs
svlandeg Mar 19, 2023
b834073
fix import
svlandeg Mar 19, 2023
73bdeb0
Merge branch 'refactor/el-candidates' into feature/docwise-generator-…
rmitsch Mar 20, 2023
cb79af3
Fix merge leftovers.
rmitsch Mar 20, 2023
e5be5d6
Merge branch 'v4' into feature/docwise-generator-batching
rmitsch Mar 20, 2023
4974769
Merge branch 'v4' into feature/docwise-generator-batching
rmitsch Apr 17, 2023
571eaf6
Update spacy/kb/kb.pyx
rmitsch Apr 24, 2023
9b677ad
Update spacy/kb/kb.pyx
rmitsch Apr 24, 2023
3ae31f7
Update website/docs/api/kb.mdx
rmitsch Apr 24, 2023
fb79b52
Update website/docs/api/entitylinker.mdx
rmitsch Apr 24, 2023
10ddefa
Update spacy/kb/kb_in_memory.pyx
rmitsch Apr 24, 2023
1ece9ec
Update website/docs/api/inmemorylookupkb.mdx
rmitsch Apr 24, 2023
cfbb4a5
Update get_candidates() docstring.
rmitsch Apr 24, 2023
7aa3758
Reformat imports in entity_linker.py.
rmitsch Apr 24, 2023
40e3aca
Merge branch 'v4' into feature/docwise-generator-batching
rmitsch Apr 24, 2023
ee5d7f4
Drop valid_ent_idx_per_doc.
rmitsch Apr 24, 2023
638103e
Update docs.
rmitsch Apr 24, 2023
2c80db9
Format.
rmitsch Apr 24, 2023
d1371d1
Simplify doc loop in predict().
rmitsch Apr 24, 2023
c655b36
Remove E1044 comment.
rmitsch Apr 25, 2023
8aa59c4
Merge branch 'v4' into feature/docwise-generator-batching
rmitsch Jul 27, 2023
a258533
Fix merge errors.
rmitsch Jul 27, 2023
61b2215
Format.
rmitsch Jul 27, 2023
aca4ada
Format.
rmitsch Jul 27, 2023
5bad3d2
Format.
rmitsch Jul 27, 2023
645b525
Fix merge error & tests.
rmitsch Jul 28, 2023
25bce73
Format.
rmitsch Jul 28, 2023
78c72d3
Merge branch 'main' into feature/docwise-generator-batching
rmitsch Jan 30, 2024
c8691a2
Apply suggestions from code review
rmitsch Jan 30, 2024
f169614
Use type alias.
rmitsch Feb 1, 2024
c174ebf
isort.
rmitsch Feb 1, 2024
d778da3
isort.
rmitsch Feb 1, 2024
aa87845
Lint.
rmitsch Feb 1, 2024
1d2994a
Add typedefs.pyx.
rmitsch Feb 1, 2024
4c7bd30
Fix typedef import.
rmitsch Feb 1, 2024
7d6ae1b
Fix type aliases.
rmitsch Feb 1, 2024
6401856
Format.
rmitsch Feb 1, 2024
af336ac
Merge branch 'upstream_main' into feature/docwise-generator-batching
svlandeg Feb 6, 2024
d6c7636
Update docstring and type usage.
rmitsch Feb 7, 2024
8a2a7f1
Merge branch 'feature/docwise-generator-batching' of github.com:rmits…
rmitsch Feb 7, 2024
5f87b6a
Add info on get_candidates(), get_candidates_batched().
rmitsch Feb 7, 2024
5d1ecf1
Readd get_candidates info to v3 changelog.
rmitsch Feb 7, 2024
c4d4926
Update website/docs/api/entitylinker.mdx
rmitsch Feb 19, 2024
2951c19
Update factory functions for backwards compatibility.
rmitsch Feb 19, 2024
ca1f86e
Merge branch 'feature/docwise-generator-batching' of github.com:rmits…
rmitsch Feb 19, 2024
79798c0
Format.
rmitsch Feb 19, 2024
c187b13
Ignore mypy error.
rmitsch Feb 19, 2024
9391de6
Fix mypy error.
rmitsch Feb 19, 2024
e83a988
Format.
rmitsch Feb 19, 2024
eef3de0
Add test for multiple docs with multiple entities.
rmitsch Feb 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion spacy/errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -950,7 +950,6 @@ class Errors(metaclass=ErrorsWithCodes):
"case pass an empty list for the previously not specified argument to avoid this error.")
E1043 = ("Expected None or a value in range [{range_start}, {range_end}] for entity linker threshold, but got "
"{value}.")
E1044 = ("Expected `candidates_batch_size` to be >= 1, but got: {value}")
E1045 = ("Encountered {parent} subclass without `{parent}.{method}` "
"method in '{name}'. If you want to use this method, make "
"sure it's overwritten on the subclass.")
Expand Down
27 changes: 8 additions & 19 deletions spacy/kb/kb.pyx
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# cython: infer_types=True

from pathlib import Path
from typing import Iterable, Tuple, Union
from typing import Iterable, Iterator, Tuple, Union

from cymem.cymem cimport Pool

from ..errors import Errors
from ..tokens import Span, SpanGroup
from ..tokens import SpanGroup
from ..util import SimpleFrozenList
from .candidate import Candidate
from .candidate cimport Candidate


cdef class KnowledgeBase:
Expand All @@ -19,6 +19,8 @@ cdef class KnowledgeBase:

DOCS: https://spacy.io/api/kb
"""
CandidatesForMentionT = Iterable[Candidate]
CandidatesForDocT = Iterable[CandidatesForMentionT]

def __init__(self, vocab: Vocab, entity_vector_length: int):
"""Create a KnowledgeBase."""
Expand All @@ -32,27 +34,14 @@ cdef class KnowledgeBase:
self.entity_vector_length = entity_vector_length
self.mem = Pool()

def get_candidates_batch(
self, mentions: SpanGroup
) -> Iterable[Iterable[Candidate]]:
def get_candidates(self, mentions: Iterator[SpanGroup]) -> Iterator[CandidatesForDocT]:
"""
Return candidate entities for a specified Span mention. Each candidate defines at least the entity and the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring needs to be updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in cfbb4a5.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still requires updating - the text refers to a single specified Span Mention.

entity's embedding vector. Depending on the KB implementation, further properties - such as the prior
probability of the specified mention text resolving to that entity - might be included.
If no candidates are found for a given mention, an empty list is returned.
mentions (SpanGroup): Mentions for which to get candidates.
RETURNS (Iterable[Iterable[Candidate]]): Identified candidates.
"""
return [self.get_candidates(span) for span in mentions]

def get_candidates(self, mention: Span) -> Iterable[Candidate]:
"""
Return candidate entities for a specific mention. Each candidate defines at least the entity and the
entity's embedding vector. Depending on the KB implementation, further properties - such as the prior
probability of the specified mention text resolving to that entity - might be included.
If no candidate is found for the given mention, an empty list is returned.
mention (Span): Mention for which to get candidates.
RETURNS (Iterable[Candidate]): Identified candidates.
mentions (Iterator[SpanGroup]): Mentions for which to get candidates.
RETURNS (Iterator[Iterable[Iterable[Candidate]]]): Identified candidates per mention/doc/doc batch.
"""
raise NotImplementedError(
Errors.E1045.format(
Expand Down
9 changes: 5 additions & 4 deletions spacy/kb/kb_in_memory.pyx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# cython: infer_types=True
from typing import Any, Callable, Dict, Iterable
from typing import Any, Callable, Dict, Iterable, Iterator

import srsly

Expand All @@ -12,7 +12,7 @@ from preshed.maps cimport PreshMap
import warnings
from pathlib import Path

from ..tokens import Span
from ..tokens import SpanGroup

from ..typedefs cimport hash_t

Expand Down Expand Up @@ -255,8 +255,9 @@ cdef class InMemoryLookupKB(KnowledgeBase):
alias_entry.probs = probs
self._aliases_table[alias_index] = alias_entry

def get_candidates(self, mention: Span) -> Iterable[InMemoryCandidate]:
return self._get_alias_candidates(mention.text) # type: ignore
def get_candidates(self, mentions: Iterator[SpanGroup]) -> Iterator[Iterable[Iterable[InMemoryCandidate]]]:
for mentions_for_doc in mentions:
yield [self._get_alias_candidates(span.text) for span in mentions_for_doc]

def _get_alias_candidates(self, str alias) -> Iterable[InMemoryCandidate]:
"""
Expand Down
38 changes: 12 additions & 26 deletions spacy/ml/models/entity_linker.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from pathlib import Path
from typing import Callable, Iterable, List, Optional, Tuple
from typing import Callable, Iterable, Iterator, List, Optional, Tuple

from thinc.api import (
Linear,
Expand Down Expand Up @@ -117,34 +117,20 @@ def empty_kb_factory(vocab: Vocab):


@registry.misc("spacy.CandidateGenerator.v1")
def create_candidates() -> Callable[[KnowledgeBase, Span], Iterable[Candidate]]:
return get_candidates


@registry.misc("spacy.CandidateBatchGenerator.v1")
svlandeg marked this conversation as resolved.
Show resolved Hide resolved
def create_candidates_batch() -> Callable[
[KnowledgeBase, SpanGroup], Iterable[Iterable[Candidate]]
def create_get_candidates() -> Callable[
[KnowledgeBase, Iterator[SpanGroup]],
Iterator[Iterable[Iterable[Candidate]]],
rmitsch marked this conversation as resolved.
Show resolved Hide resolved
]:
return get_candidates_batch


def get_candidates(kb: KnowledgeBase, mention: Span) -> Iterable[Candidate]:
"""
Return candidate entities for a given mention and fetching appropriate entries from the index.
kb (KnowledgeBase): Knowledge base to query.
mention (Span): Entity mention for which to identify candidates.
RETURNS (Iterable[Candidate]): Identified candidates.
"""
return kb.get_candidates(mention)
return get_candidates


def get_candidates_batch(
kb: KnowledgeBase, mentions: SpanGroup
) -> Iterable[Iterable[Candidate]]:
def get_candidates(
kb: KnowledgeBase, mentions: Iterator[SpanGroup]
) -> Iterator[Iterable[Iterable[Candidate]]]:
"""
Return candidate entities for the given mentions and fetching appropriate entries from the index.
Return candidate entities for the given mentions from the KB.
kb (KnowledgeBase): Knowledge base to query.
mentions (SpanGroup): Entity mentions for which to identify candidates.
RETURNS (Iterable[Iterable[Candidate]]): Identified candidates.
mentions (Iterator[SpanGroup]): Mentions per doc.
RETURNS (Iterator[Iterable[Iterable[Candidate]]]): Identified candidates per mentions in document/SpanGroup.
"""
return kb.get_candidates_batch(mentions)
return kb.get_candidates(mentions)
Loading
Loading