-
My apologies up front if this is too vague, but I am trying to understand when setting a key on a datatable Frame is slow. Specifically, I have a three column Frame,
However, when I try to set a key on
it churns away for more than 60 seconds.. Anyone have any suggestions, as to under what conditions makes this operation particulary slow? On the same machine using R data.table, the same operation takes no time at all
|
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 13 replies
-
To set a key on a set of columns, datatable invokes a group by operation on those columns: https://github.com/h2oai/datatable/blob/main/src/core/frame/key.cc#L118 Grouping two columns having 400K rows is a pretty expensive operation, because at the end it requires sorting of one string and one integer column. Note, that internally From what I see in the R's data.table documentation, Btw, can you share your data and platform information? My feeling that sorting of two columns should not take 60s anyways. Also, what are the units for the R's data.table timing? If that's in seconds, I guess Python datatable should demonstrate similar performance. At least this is what I observe locally when I group by randomly generated 400K x 2 frame. |
Beta Was this translation helpful? Give feedback.
-
Thanks much for your time @oleksiyskononenko , I've figured out that it is most definitely a platform issue.. On a similar machine, I'm able to set the key (good question, but not, it is not already sorted), using datatable in a time frame that is similar to what I'm seeing with R.
There are some import differences between the machines.
I would try to install the development version from github on the slower performing, but I can't get it to compile.. the gcc version perhaps is too old? 4.8.5? |
Beta Was this translation helpful? Give feedback.
-
@oleksiyskononenko its interesting, because on the same platform, pandas is not having a problem at all, with sorting: With pandas - read and sort %%time
import pandas as pd
df =pd.read_csv("df.csv")
df.sort_values(by=['a','b'])
CPU times: user 180 ms, sys: 14.4 ms, total: 194 ms
Wall time: 194 ms With datatable - read and sort %%time
import datatable as dt
df = dt.fread("df.csv")
df.sort(['a','b'])
CPU times: user 5min 48s, sys: 121 ms, total: 5min 48s
Wall time: 21.9 s |
Beta Was this translation helpful? Give feedback.
-
@oleksiyskononenko , I set %%timeit
df.sort(['a','b'])
69.1 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) However, there is more to the story. The problem only seems to be occuring within an interactive jupyter notebook session on a compute cluster. If I ssh into the cluster, and run the same code in a python terminal (where |
Beta Was this translation helpful? Give feedback.
-
Solution: Make sure that the number of threads being detected by datatable upon initialization does not exceed the actual maximum available. In some situations, the number for cpus on the system (say extracted from In such cases, extract the actual number available and set using |
Beta Was this translation helpful? Give feedback.
Solution: Make sure that the number of threads being detected by datatable upon initialization does not exceed the actual maximum available. In some situations, the number for cpus on the system (say extracted from
/proc/cpu_info
for example, will be more than are actually available to the job (say on a compute cluster where the number of cpus provided to the job is less than that on the node).In such cases, extract the actual number available and set using
datatable.options.nthreads=x
, wherex
is the equal to or less than the actual number available.