LDA2Vec doesn't work at all; does anyone have the correct code for python 3? #84

haebichan · 2018-10-23T21:04:31Z

LDA2Vec doesn't seem to work at all at this current stage. Gensim code is outdated, the general code runs on Python 2.7, and people seem to be having problems with Chainer and other stuff.

I tried to revise the code to Python 3, but I'm hitting walls here and there, especially since I don't know how exactly every function is working. Did anyone solve these general issues? Did it actually work for anyone recently?

bosulliv · 2018-10-31T01:07:47Z

It is quite broken, even on python 2. I spun up a virtualenv, and spent an hour trying to wrestle the latest spacy API into the code. The problems for me are in preprocess.py: I've updated spacy to nlp = spacy.load('en') and also converted the document attribute arrays to 64 bit integers instead of 32 bit which were overflowing: But it is still producing negative values in the matrix which fail the assertion. I can't tell if another hour will solve it, so I'm going to carry on improving my LDA, NMF and LSA topic models instead.

haebichan · 2018-11-01T20:51:06Z

Hey thanks for responding and confirming it. The nlp = spacy.load('en') shouldn't work since that's deprecated and changed to nlp = spacy.load('en_core_web_sm'). But there's so many other problems, I'm not sure if it's worth trying to fix everything.

aleksandra-datacamp · 2018-11-09T08:50:00Z

If you use np.uint64 as dtype, it works. Preprocess becomes:

def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None,
             **kwargs):
    if nlp is None:
        nlp = spacy.load('en_core_web_md')
    data = np.zeros((len(texts), max_length), dtype='uint64')
    skip = np.uint64(skip)
    data[:] = skip
    bad_deps = ('amod', 'compound')
    for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
        if merge:
            # from the spaCy blog, an example on how to merge
            # noun phrases into single tokens
            for phrase in doc.noun_chunks:
                # Only keep adjectives and nouns, e.g. "good ideas"
                while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
                    phrase = phrase[1:]
                if len(phrase) > 1:
                    # Merge the tokens, e.g. good_ideas
                    phrase.merge(phrase.root.tag_, phrase.text,
                                 phrase.root.ent_type_)
                # Iterate over named entities
                for ent in doc.ents:
                    if len(ent) > 1:
                        # Merge them into single tokens
                        ent.merge(ent.root.tag_, ent.text, ent.label_)
        dat = doc.to_array([attr, LIKE_URL, LIKE_EMAIL])
        if len(dat) > 0:
            msg = "Negative indices reserved for special tokens" 
            assert dat.min() >= 0, msg
            # Replace email and URL tokens
            # select the indices of tokens that are URLs or Emails
            idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
            dat = dat.astype('int64')
            dat[idx] = skip
            length = min(len(dat), max_length)
            data[row, :length] = dat[:length, 0].ravel()
    uniques = np.unique(data)
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
    vocab[skip] = '<SKIP>'
    return data, vocab

ghost · 2018-11-30T01:09:14Z

I can't even successfully execute "python setup.py install". A lot of errors occur in C++ code: #86

GregSilverman · 2018-12-06T02:01:11Z

Here's a port to tensorflow that allegedly works with python 3 lda2vec-tf. Here's also a port to pytorch lda2vec-pytorch (NB: in the pytorch readme, it says

"Warning: I, personally, believe that it is quite hard to make lda2vec algorithm work.
Sometimes it finds a couple of topics, sometimes not. Usually a lot of found topics are a total mess.
The algorithm is prone to poor local minima. It greatly depends on values of initial topic assignments."

Not very encouraging, which is kind of disappointing.

MChrys · 2019-01-31T14:24:48Z

Here's a port to tensorflow that allegedly works with python 3 lda2vec-tf. Here's also a port to pytorch lda2vec-pytorch (NB: in the pytorch readme, it says

"Warning: I, personally, believe that it is quite hard to make lda2vec algorithm work.
Sometimes it finds a couple of topics, sometimes not. Usually a lot of found topics are a total mess.
The algorithm is prone to poor local minima. It greatly depends on values of initial topic assignments."

Not very encouraging, which is kind of disappointing.

Hello Greg,
It's my first impot on github, and i was unable to import the original repository, (no module Lda2vec) I would like do it with the tensorflow repo but there no documentation or example.
Could you repo the code you used with your own test ? it would be awesome !

GregSilverman · 2019-01-31T15:30:53Z

I haven't actually done anything with it! I was hoping someone else had. ^_^

MChrys · 2019-01-31T15:47:04Z

ok :) thank you for your answer

nateraw · 2019-02-07T19:22:02Z

I also have my own tensorflow implementation up, adapted from the one @MChrys linked to. Again, it works, but it is very finicky

khan04 · 2019-02-10T08:43:23Z

Hello all,

I was struggling to setup, and also run some of the functions with Python 3.7. I got installed, but facing lot of issues. I could visualize 20newsgroup data as I have the the generated file available. Trying to create the file in .npz format, no luck yet.

Question to Chris: Just wondering if you have a working version (most latest) that we can try out? Also facing lot of issues with Cupy install. Can we run without GPU functionality?

Thank you!

Ahmed Khan

whcjimmy · 2019-02-14T09:13:30Z

try my fork: https://github.com/whcjimmy/lda2vec.

I've tested the twenty_newsgroups example.

khan04 · 2019-02-14T18:03:40Z

I will try yous, thank you so much Jimmy. Just two questions: 1) In the doc in your Github, it says word vector for 'German' is -0.6, any idea how to get that number. Also on the RHS, the document vector is -0.7 - how to get that one as well? 2) I was getting these errors in compiling preprocess file under example dir: File "C:\Users\Administrator\Anaconda3\lib\site-packages\lda2vec\corpus.py", line 159, in finalize self.specials_to_compact = {s: self.loose_to_compact[i] for s, i in self.specials.items()} File "C:\Users\Administrator\Anaconda3\lib\site-packages\lda2vec\corpus.py", line 159, in <dictcomp> self.specials_to_compact = {s: self.loose_to_compact[i] for s, i in self.specials.items()} KeyError: -1 Did you get similar errors as well? Thanks, AK

…

On Thu, Feb 14, 2019 at 1:13 AM Jimmy Wang ***@***.***> wrote: try my fork: https://github.com/whcjimmy/lda2vec. I've tested the twenty_newsgroups example. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#84 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AlkRX2esMIt4-CnEdG2mbLcbu_A7Pn2Jks5vNSi-gaJpZM4X2qM1> .

whcjimmy · 2019-02-15T01:31:17Z

My doc follows this repo and i didn't change any details, but I can try to answer your questions.

It doesn't mean "German" is -0.6. The whole 1 * 5 word vector is used to represent a word "German". Maybe the word vector comes from a pre-trained word2vec model which is GoogleNews-vectors-negative300.bin, I am not that sure.
Document vector comes from Document proportion multiplies with topic matrix. So, 0.41*(-1.9)+0.26 * 0.96+0.34*(-0.7) is -0.7674 which nears -0.7.

I didn't get this error. However, in corpus.py, you can find out that the only key number less than 0 is "-2" whcih means special tokens (in line 140).
Maybe you can check why key number "-1" is generated.

Hope these answers help you!

JennieGerhardt · 2019-12-03T16:14:03Z

when using the file 'preprogress.py'，the outcome of vocab is bad?
12521213015474045184: u"max>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'avpvt_%n2ijl8ymd9#oq",
6474950898978915842: u'160k',
13196128760786322950: u'liberty',

lordtt13 · 2020-07-01T11:15:53Z

If you use np.uint64 as dtype, it works. Preprocess becomes:

def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None,
             **kwargs):
    if nlp is None:
        nlp = spacy.load('en_core_web_md')
    data = np.zeros((len(texts), max_length), dtype='uint64')
    skip = np.uint64(skip)
    data[:] = skip
    bad_deps = ('amod', 'compound')
    for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
        if merge:
            # from the spaCy blog, an example on how to merge
            # noun phrases into single tokens
            for phrase in doc.noun_chunks:
                # Only keep adjectives and nouns, e.g. "good ideas"
                while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
                    phrase = phrase[1:]
                if len(phrase) > 1:
                    # Merge the tokens, e.g. good_ideas
                    phrase.merge(phrase.root.tag_, phrase.text,
                                 phrase.root.ent_type_)
                # Iterate over named entities
                for ent in doc.ents:
                    if len(ent) > 1:
                        # Merge them into single tokens
                        ent.merge(ent.root.tag_, ent.text, ent.label_)
        dat = doc.to_array([attr, LIKE_URL, LIKE_EMAIL])
        if len(dat) > 0:
            msg = "Negative indices reserved for special tokens" 
            assert dat.min() >= 0, msg
            # Replace email and URL tokens
            # select the indices of tokens that are URLs or Emails
            idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
            dat = dat.astype('int64')
            dat[idx] = skip
            length = min(len(dat), max_length)
            data[row, :length] = dat[:length, 0].ravel()
    uniques = np.unique(data)
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
    vocab[skip] = '<SKIP>'
    return data, vocab

Tried this out, doesn't work

lordtt13 · 2020-07-01T11:18:53Z

Basically I have tried everything out in porting it to python 3, and I'm not even able to get the preprocess functions working. Saw this issue and tried out everything here too. Going to use gensim LDA.

duaaalkhafaje · 2021-08-21T13:29:21Z

Basically I have tried everything out in porting it to python 3, and I'm not even able to get the preprocess functions working. Saw this issue and tried out everything here too. Going to use gensim LDA.

Hello from 2021
I wonder if you have completed the work on (LDA2Vec) or not?, because frankly, I have worked on it a lot, but I still face many problems

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDA2Vec doesn't work at all; does anyone have the correct code for python 3? #84

LDA2Vec doesn't work at all; does anyone have the correct code for python 3? #84

haebichan commented Oct 23, 2018

bosulliv commented Oct 31, 2018 •

edited

Loading

haebichan commented Nov 1, 2018

aleksandra-datacamp commented Nov 9, 2018 •

edited

Loading

ghost commented Nov 30, 2018 •

edited by ghost

Loading

GregSilverman commented Dec 6, 2018 •

edited

Loading

MChrys commented Jan 31, 2019

GregSilverman commented Jan 31, 2019 •

edited

Loading

MChrys commented Jan 31, 2019

nateraw commented Feb 7, 2019

khan04 commented Feb 10, 2019

whcjimmy commented Feb 14, 2019

khan04 commented Feb 14, 2019 via email

whcjimmy commented Feb 15, 2019 •

edited

Loading

JennieGerhardt commented Dec 3, 2019

lordtt13 commented Jul 1, 2020 •

edited

Loading

lordtt13 commented Jul 1, 2020

duaaalkhafaje commented Aug 21, 2021

LDA2Vec doesn't work at all; does anyone have the correct code for python 3? #84

LDA2Vec doesn't work at all; does anyone have the correct code for python 3? #84

Comments

haebichan commented Oct 23, 2018

bosulliv commented Oct 31, 2018 • edited Loading

haebichan commented Nov 1, 2018

aleksandra-datacamp commented Nov 9, 2018 • edited Loading

ghost commented Nov 30, 2018 • edited by ghost Loading

GregSilverman commented Dec 6, 2018 • edited Loading

MChrys commented Jan 31, 2019

GregSilverman commented Jan 31, 2019 • edited Loading

MChrys commented Jan 31, 2019

nateraw commented Feb 7, 2019

khan04 commented Feb 10, 2019

whcjimmy commented Feb 14, 2019

khan04 commented Feb 14, 2019 via email

whcjimmy commented Feb 15, 2019 • edited Loading

JennieGerhardt commented Dec 3, 2019

lordtt13 commented Jul 1, 2020 • edited Loading

lordtt13 commented Jul 1, 2020

duaaalkhafaje commented Aug 21, 2021

bosulliv commented Oct 31, 2018 •

edited

Loading

aleksandra-datacamp commented Nov 9, 2018 •

edited

Loading

ghost commented Nov 30, 2018 •

edited by ghost

Loading

GregSilverman commented Dec 6, 2018 •

edited

Loading

GregSilverman commented Jan 31, 2019 •

edited

Loading

whcjimmy commented Feb 15, 2019 •

edited

Loading

lordtt13 commented Jul 1, 2020 •

edited

Loading