Skip to content

Commit

Permalink
docs: update README to clarify text chunking process and add visual r…
Browse files Browse the repository at this point in the history
…epresentation of ISCC generation process
  • Loading branch information
titusz committed Aug 19, 2024
1 parent 147e9c4 commit 18933c3
Showing 1 changed file with 17 additions and 1 deletion.
18 changes: 17 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,13 +135,29 @@ options:

`iscc-sct` employs the following process:

1. Splits the text into semantically coherent chunks.
1. Splits the text into overlaping chunks (using syntactically sensible breakpoints).
1. Uses a pre-trained deep learning model for text embedding.
1. Generates feature vectors capturing essential characteristics of the chunks.
1. Aggregates these vectors and binarizes them to produce a Semantic Text-Code.
1. Prefixes the binarized vector with the matching ISCC header, encodes it with base32, and adds the
"ISCC:" prefix.

This process ensures robustness to variations and translations, enabling cross-lingual matching.

Here's a visual representation of the ISCC Semantic Text-Code generation process:

```mermaid
graph TD
A[Input Text] --> B[Split into Overlapping Chunks]
B --> C[Create Multilingual Vector Embeddings per Chunk]
C --> D[Calculate Document Vector using Mean Pooling]
D --> E[Binarize Document Vector]
E --> F[Prefix with ISCC Header]
F --> G[Encode with Base32]
G --> H[Prefix with 'ISCC:']
H --> I[Final ISCC Semantic Text-Code]
```

## Development and Contributing

We welcome contributions to enhance the capabilities and efficiency of this proof of concept. For
Expand Down

0 comments on commit 18933c3

Please sign in to comment.