Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Book 1] Ch.1 potential typo: Normalizing TF-IDF per row vs column #1115

Open
yi-jenc opened this issue Apr 23, 2024 · 0 comments
Open

[Book 1] Ch.1 potential typo: Normalizing TF-IDF per row vs column #1115

yi-jenc opened this issue Apr 23, 2024 · 0 comments

Comments

@yi-jenc
Copy link

yi-jenc commented Apr 23, 2024

On p.25 of book 1 (in the latest available online version dated back June 2023), in Section 1.5.4.2, it is stated that we often normalize each row of the TF-IDF matrix. According to the definition of TF-IDF in the book, i.e., $(TF-IDF)_{ij}$ refers to the frequency of the $i$-th term in the $j$-th document, normalizing each row corresponds to comparing (the occurrences of) all the words on the same scale.

Just wonder whether we actually want to normalize each column, instead of each row, of TF-IDF? This corresponds to comparing all the documents on the same scale, regardless of their lengths.

Screenshot 2024-04-23 at 11 45 23

Also, there is some minute notation inconsistency in the following Sec. 1.5.4.3. Previous, the size of the vocabulary was denoted by $D$ (as what we do in most of the book), while here we switch to the undefined $V$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant