Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDA2Vec doesn't work at all; does anyone have the correct code for python 3? #84

Open
haebichan opened this issue Oct 23, 2018 · 17 comments

Comments

@haebichan
Copy link

LDA2Vec doesn't seem to work at all at this current stage. Gensim code is outdated, the general code runs on Python 2.7, and people seem to be having problems with Chainer and other stuff.

I tried to revise the code to Python 3, but I'm hitting walls here and there, especially since I don't know how exactly every function is working. Did anyone solve these general issues? Did it actually work for anyone recently?

@bosulliv
Copy link

bosulliv commented Oct 31, 2018

It is quite broken, even on python 2. I spun up a virtualenv, and spent an hour trying to wrestle the latest spacy API into the code. The problems for me are in preprocess.py: I've updated spacy to nlp = spacy.load('en') and also converted the document attribute arrays to 64 bit integers instead of 32 bit which were overflowing: But it is still producing negative values in the matrix which fail the assertion. I can't tell if another hour will solve it, so I'm going to carry on improving my LDA, NMF and LSA topic models instead.

@haebichan
Copy link
Author

Hey thanks for responding and confirming it. The nlp = spacy.load('en') shouldn't work since that's deprecated and changed to nlp = spacy.load('en_core_web_sm'). But there's so many other problems, I'm not sure if it's worth trying to fix everything.

@aleksandra-datacamp
Copy link

aleksandra-datacamp commented Nov 9, 2018

If you use np.uint64 as dtype, it works. Preprocess becomes:

def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None,
             **kwargs):
    if nlp is None:
        nlp = spacy.load('en_core_web_md')
    data = np.zeros((len(texts), max_length), dtype='uint64')
    skip = np.uint64(skip)
    data[:] = skip
    bad_deps = ('amod', 'compound')
    for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
        if merge:
            # from the spaCy blog, an example on how to merge
            # noun phrases into single tokens
            for phrase in doc.noun_chunks:
                # Only keep adjectives and nouns, e.g. "good ideas"
                while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
                    phrase = phrase[1:]
                if len(phrase) > 1:
                    # Merge the tokens, e.g. good_ideas
                    phrase.merge(phrase.root.tag_, phrase.text,
                                 phrase.root.ent_type_)
                # Iterate over named entities
                for ent in doc.ents:
                    if len(ent) > 1:
                        # Merge them into single tokens
                        ent.merge(ent.root.tag_, ent.text, ent.label_)
        dat = doc.to_array([attr, LIKE_URL, LIKE_EMAIL])
        if len(dat) > 0:
            msg = "Negative indices reserved for special tokens" 
            assert dat.min() >= 0, msg
            # Replace email and URL tokens
            # select the indices of tokens that are URLs or Emails
            idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
            dat = dat.astype('int64')
            dat[idx] = skip
            length = min(len(dat), max_length)
            data[row, :length] = dat[:length, 0].ravel()
    uniques = np.unique(data)
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
    vocab[skip] = '<SKIP>'
    return data, vocab

@ghost
Copy link

ghost commented Nov 30, 2018

I can't even successfully execute "python setup.py install". A lot of errors occur in C++ code: #86

@GregSilverman
Copy link

GregSilverman commented Dec 6, 2018

Here's a port to tensorflow that allegedly works with python 3 lda2vec-tf. Here's also a port to pytorch lda2vec-pytorch (NB: in the pytorch readme, it says

"Warning: I, personally, believe that it is quite hard to make lda2vec algorithm work.
Sometimes it finds a couple of topics, sometimes not. Usually a lot of found topics are a total mess.
The algorithm is prone to poor local minima. It greatly depends on values of initial topic assignments."

Not very encouraging, which is kind of disappointing.

@MChrys
Copy link

MChrys commented Jan 31, 2019

Here's a port to tensorflow that allegedly works with python 3 lda2vec-tf. Here's also a port to pytorch lda2vec-pytorch (NB: in the pytorch readme, it says

"Warning: I, personally, believe that it is quite hard to make lda2vec algorithm work.
Sometimes it finds a couple of topics, sometimes not. Usually a lot of found topics are a total mess.
The algorithm is prone to poor local minima. It greatly depends on values of initial topic assignments."

Not very encouraging, which is kind of disappointing.

Hello Greg,
It's my first impot on github, and i was unable to import the original repository, (no module Lda2vec) I would like do it with the tensorflow repo but there no documentation or example.
Could you repo the code you used with your own test ? it would be awesome !

@GregSilverman
Copy link

GregSilverman commented Jan 31, 2019

I haven't actually done anything with it! I was hoping someone else had. ^_^

@MChrys
Copy link

MChrys commented Jan 31, 2019

ok :) thank you for your answer

@nateraw
Copy link

nateraw commented Feb 7, 2019

I also have my own tensorflow implementation up, adapted from the one @MChrys linked to. Again, it works, but it is very finicky

@khan04
Copy link

khan04 commented Feb 10, 2019

Hello all,

I was struggling to setup, and also run some of the functions with Python 3.7. I got installed, but facing lot of issues. I could visualize 20newsgroup data as I have the the generated file available. Trying to create the file in .npz format, no luck yet.

Question to Chris: Just wondering if you have a working version (most latest) that we can try out? Also facing lot of issues with Cupy install. Can we run without GPU functionality?

Thank you!

Ahmed Khan

@whcjimmy
Copy link

try my fork: https://github.com/whcjimmy/lda2vec.

I've tested the twenty_newsgroups example.

@khan04
Copy link

khan04 commented Feb 14, 2019 via email

@whcjimmy
Copy link

whcjimmy commented Feb 15, 2019

My doc follows this repo and i didn't change any details, but I can try to answer your questions.

It doesn't mean "German" is -0.6. The whole 1 * 5 word vector is used to represent a word "German". Maybe the word vector comes from a pre-trained word2vec model which is GoogleNews-vectors-negative300.bin, I am not that sure.
Document vector comes from Document proportion multiplies with topic matrix. So, 0.41*(-1.9)+0.26 * 0.96+0.34*(-0.7) is -0.7674 which nears -0.7.

I didn't get this error. However, in corpus.py, you can find out that the only key number less than 0 is "-2" whcih means special tokens (in line 140).
Maybe you can check why key number "-1" is generated.

Hope these answers help you!

@JennieGerhardt
Copy link

when using the file 'preprogress.py',the outcome of vocab is bad?
12521213015474045184: u"max>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'avpvt_%n2ijl8ymd9#oq",
6474950898978915842: u'160k',
13196128760786322950: u'liberty',

@lordtt13
Copy link

lordtt13 commented Jul 1, 2020

If you use np.uint64 as dtype, it works. Preprocess becomes:

def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None,
             **kwargs):
    if nlp is None:
        nlp = spacy.load('en_core_web_md')
    data = np.zeros((len(texts), max_length), dtype='uint64')
    skip = np.uint64(skip)
    data[:] = skip
    bad_deps = ('amod', 'compound')
    for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
        if merge:
            # from the spaCy blog, an example on how to merge
            # noun phrases into single tokens
            for phrase in doc.noun_chunks:
                # Only keep adjectives and nouns, e.g. "good ideas"
                while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
                    phrase = phrase[1:]
                if len(phrase) > 1:
                    # Merge the tokens, e.g. good_ideas
                    phrase.merge(phrase.root.tag_, phrase.text,
                                 phrase.root.ent_type_)
                # Iterate over named entities
                for ent in doc.ents:
                    if len(ent) > 1:
                        # Merge them into single tokens
                        ent.merge(ent.root.tag_, ent.text, ent.label_)
        dat = doc.to_array([attr, LIKE_URL, LIKE_EMAIL])
        if len(dat) > 0:
            msg = "Negative indices reserved for special tokens" 
            assert dat.min() >= 0, msg
            # Replace email and URL tokens
            # select the indices of tokens that are URLs or Emails
            idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
            dat = dat.astype('int64')
            dat[idx] = skip
            length = min(len(dat), max_length)
            data[row, :length] = dat[:length, 0].ravel()
    uniques = np.unique(data)
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
    vocab[skip] = '<SKIP>'
    return data, vocab

Tried this out, doesn't work

@lordtt13
Copy link

lordtt13 commented Jul 1, 2020

Basically I have tried everything out in porting it to python 3, and I'm not even able to get the preprocess functions working. Saw this issue and tried out everything here too. Going to use gensim LDA.

@duaaalkhafaje
Copy link

Basically I have tried everything out in porting it to python 3, and I'm not even able to get the preprocess functions working. Saw this issue and tried out everything here too. Going to use gensim LDA.

Hello from 2021
I wonder if you have completed the work on (LDA2Vec) or not?, because frankly, I have worked on it a lot, but I still face many problems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests