v0.9.0
0.9.0 — 2019-06-15
Added
-
Added function
dt.models.kfold(nrows, nsplits)
to prepare indices for
k-fold splitting. This function will returnnsplits
pairs of row selectors
such that when these selectors are applied to annrows
-rows frame, that
frame will be split into train and test part according to the K-fold
splitting scheme. -
Added function
dt.models.kfold_random(nrows, nsplits, seed)
, which is
similar tokfold(nrows, nsplits)
, except that the assignment of rows into
folds is randomized, not deterministic. -
Frame.rbind()
can now also accept a list or tuple of frames (previously
only a vararg sequence was allowed). -
Method
.len()
can be applied to a string column to obtain the lengths
of strings in each row. -
Method
.re_match(re)
applies to a string column, and produces boolean
indicator whether each value matches the regular expressionre
or not.
The method matches the entire string, not just the beginning. Thus, it
most closely resembles Python functionre.fullmatch()
. -
Added early stopping support to FTRL algo, that can now do binomial and
multinomial classification for categorical targets, as well as regression
for continuous targets. -
New function
dt.median()
can be used to compute median of a certain
column or expression, either per group or for the entire Frame (#1530). -
Frame.__str__()
now returns a string containing the preview of the
frame's data. This allows datatable frames to be used withprint()
. -
Added method
dt.options.describe()
, which will print the available
options together with their values and descriptions. -
Added
dt.options.context(option=value)
, which can be used in a with-
statement to temporarily change the value of one or more options, and
then go back to their original values at the end of the with-block. -
Added options
fread.log.escape_unicode
(controls treatment of unicode
characters in fread's verbose log); anddisplay.use_colors
(allows
to turn on/off colored output in the console). -
dt.options
now helps the user when they make a typo: if an option
with a certain name does not exist, the error message will suggest the
correct spelling. -
most long-running operations in
datatable
will now show a progress bar.
Its behavior can be controlled viadt.options.progress
set of options. -
internal function
dt.internal.compiler_version()
. -
New
datatable.math
module is a library of various mathematical functions
that can be applied to datatable Frames. The set of functions is close to
what is available in the standard pythonmath
module. See documentation
for more details. -
New module
datatable.sphinxext.dtframe_directive
, which can be used as
a plugin for Sphinx. This module adds directive.. dtframe
that allows
to easily include a Frame display in an .rst document. -
Frame can now be treated as an iterable over the columns. Thus, a Frame
object can now be used in a for-loop, producing its individual columns. -
A Frame can now be treated as a mapping; in particular both
dict(frame)
and**frame
are now valid. -
Single-column frames can be be used as sources for Frame construction.
-
CSV writer now quotes fields containing single-quote mark (
'
). -
Added parameter
quoting=
to methodFrame.to_csv()
. The accepted values
are 4 constants from the standardcsv
module:csv.QUOTE_MINIMAL
(default),csv.QUOTE_ALL
,csv.QUOTE_NONNUMERIC
andcsv.QUOTE_NONE
.
Fixed
-
Fixed crash in certain circumstances when a key was applied after a
groupby (#1639). -
Frame.to_numpy()
now returns a numpymasked_array
if the frame has
any NA values (#1619). -
A keyed frame will now be rendered correctly when viewing it in python
console viaFrame.view()
(#1672). -
Str32 column can no longer overflow during the
.replace()
operation,
or when converting from python, numpy or pandas, etc. In all these cases
we will now transparently create a Str64 column instead (#1694). -
The reported frame size (
sys.getsizeof(DT)
) is now more accurate; in
particular the content of string columns is no longer ignored (#1697). -
Type casting into str32 no longer produces an error if the resulting column
is larger than 2GB. Now a str64 column will be returned instead (#1695). -
Fixed memory leak during computation of a generic
DT[i, j]
expression.
Another memory leak was during generation of string columns, now also fixed
(#1705). -
Fixed crash upon exiting from a python terminal, if the user ever called
functionframe_column_rowindex().type
(#1703). -
Pandas "boolean column with NAs" (of dtype
object
) now converts into
datatablebool8
column when pandas DataFrame is converted into a datatable
Frame (#1730). -
Fixed conversion to numpy of a view Frame which contains NAs (#1738).
-
datatable
can now be safely used withmultiprocessing
, or other modules
that perform fork-without-exec (#1758). The child process will spawn its
own thread pool that will have the same number of threads as the parent.
Adjustdt.options.nthreads
in the child process(es) if different number
of threads is required. -
The interactive mode is no longer improperly turned on in IPython (#1789).
-
Fixed issue with mis-aligned frame headers in IPython, caused by IPython
insertingOut[X]:
in front of the rendered Frame display (#1793). -
Improved rendering of Frames in terminals with white background: we no longer
use 'bright_white' color for emphasis, only 'bold' (#1793). -
Fixed crash when a new column was created via partial assignment, i.e.
DT[i, "new_col"] = expr
(#1800). -
Fixed memory leaks/crashes when materializing an object column (#1805).
-
Fixed creating a Frame from a pandas DataFrame that has duplicate column
names (#1816). -
Fixed a UnicodeDecodeError that could be thrown when viewing a Frame with
unicode characters in Jupyter notebook. The error only manifested for
strings that were longer than 50 bytes in length (#1825). -
Fixed crash when
Frame.colindex()
was used without any arguments, now this
raises an exception instead (#1834). -
Fixed possible crash when writing to disk that doesn't have enough free space
on it (#1837). -
Fixed invalid Frame being created when reading a large string column (str64)
with fread, and the column contains NA values. -
Fixed FTRL model not resuming properly after unpickling (#1846).
-
Fixed crash that occurred when sorting by multiple columns, and the first
column is of low cardinality (#1857). -
Fixed display of NA values produced during a join, when a Frame was displayed
in Jupyter Lab (#1872). -
Fixed a crash when replacing values in a str64 column (#1890).
-
cbind()
no longer throws an error when passed a generator producing
temporary frames (#1905). -
Fixed comparison of string columns vs. value
None
(#1912). -
Fixed a crash when trying to select individual cells from a joined Frame,
for the cells that were un-matched during the join (#1917). -
Fixed a crash when writing a joined frame into CSV (#1919).
-
Fixed a crash when writing into CSV string view columns, especially of
str64 type (#1921).
Changed
-
A Frame will no longer be shown in "interactive" mode in console by default.
The previous behavior can be restored with
dt.options.display.interactive = True
. Alternatively, you can explore a
Frame interactively usingframe.view(True)
. -
Improved performance of type-casting a view column: now the code avoids
materializing the column before performing the cast. -
Frame
class is now defined fully in C++, improving code robustness and
performance. The propertyFrame.internal
was removed, as it no longer
represents anything. Certain internal properties ofFrame
can be accessed
via functions declared in thedt.internal.
module. -
datatable
no longer uses OpenMP for parallelism. Instead, we use our own
thread pool to perform multi-threaded computations (#1736). -
Parameter
progress_fn
in functiondt.models.aggregate()
is removed.
In its place you can set the global optiondt.options.progress.callback
. -
Removed deprecated Frame methods
.topython()
,.topandas()
,.tonumpy()
,
andFrame.__call__()
. -
Syntax
DT[col]
has been restored (was previously deprecated in 0.7.0),
however it works only forcol
an integer or a string. Support for slices
may be added in the future, or not: there is a potential to confuse
DT[a:b]
for a row selection. A column slice may still be selected via
the i-j selectorDT[:, a:b]
. -
The
nthreads=
parameter inFrame.to_csv()
was removed. If needed, please
set the global optiondt.options.nthreads
.
Deprecated
-
Frame method
.scalar()
is now deprecated and will be removed in release
0.10.0. Please useframe[0, 0]
instead. -
Frame method
.append()
is now deprecated and will be removed in release
0.10.0. Please use.rbind()
instead. -
Frame method
.save()
was renamed into.to_jay()
(for consistency with
other.to_*()
methods). The old name is still usable, but marked as
deprecated and will be removed in 0.10.0.
Notes
-
Thanks to everyone who helped make
datatable
more stable by discovering
and reporting bugs that were fixed in this release:-
[Arno Candel][] (#1619, #1730, #1738, #1800, #1803, #1846, #1857, #1890,
#1891, #1919, #1921), -
[Antorsae][] (#1639),
-
[Olivier][] (#1872),
-
[Hawk Berry][] (#1834),
-
[Mateusz Dymczyk][] (#1912),
-
[Pasha Stetsenko][] (#1672, #1694, #1695, #1697, #1703, #1705, #1905,
#1917), -
[Tom Kraljevic][] (#1805),
-
[XiaomoWu][] (#1825)
-