Parallelize TextCategorizer training #3828

tsoernes · 2019-06-06T14:25:30Z

tsoernes
Jun 6, 2019

Is there a way to do this? It's using only 1 CPU core (by design, it seems) but it's taking an awfully long long time; shame to have 15 other threads sitting idle.

BreakBB · 2019-06-07T07:10:39Z

BreakBB
Jun 7, 2019

As stated here at stackoverflow spaCy wasn't build to run on mutliple CPUs but to be efficent. You can run some tasks of spaCy in parallel as commented in this issue, but training doesn't seem to be included.

Could you quantify your "awfully long long time" and the dataset you're using? I am working with the TextCategorizer as well and don't really face long training times on my not-state-of-the-art CPU.

0 replies

radu-gheorghe · 2019-12-10T10:38:19Z

radu-gheorghe
Dec 10, 2019

In my case, with 120K lines in the dataset, training seems to take about 4 hours (I'm talking about NER here, maybe a different issue?).

As I'm still in the beginning, I'd like to be able to iterate faster (and my CPU is at 12% or so during training, so it can do more work) and be able to experiment with parameters, different data layouts, etc.

Is there any way, even a hacky way, to make training faster? Maybe the training set is too large? But then I see that if I make it small it learn that well :) Unless I go crazy with the learning rate, which seems to backfire, too.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize TextCategorizer training #3828

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Parallelize TextCategorizer training #3828

tsoernes Jun 6, 2019

Replies: 2 comments

BreakBB Jun 7, 2019

radu-gheorghe Dec 10, 2019

tsoernes
Jun 6, 2019

BreakBB
Jun 7, 2019

radu-gheorghe
Dec 10, 2019