Merge branch 'release-1.5'

DistrictDataLabs · Aug 21, 2022 · 223a252 · 223a252
2 parents cbac5e3 + 91cf014
commit 223a252
Show file tree

Hide file tree

Showing 177 changed files with 4,405 additions and 559 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -70,4 +70,4 @@ Here's a handy checklist to go through before submitting a PR, note that you can
 
 <!-- If you've added to the docs -->
 
-- [ ] _Have you built the docs using `make html`?_
+- [ ] _Have you built the docs using `make html` (must be run from `docs/`)?_
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -127,4 +127,4 @@ jobs:
       - name: Run Sphinx
         uses: ammaraskar/sphinx-action@master
         with:
-          docs-folder: "docs/"
+          docs-folder: "docs/"
diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml
@@ -0,0 +1,34 @@
+# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
+# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
+
+name: Yellowbrick PR Linting
+
+on:
+  # Trigger on pull request always (note the trailing colon)
+  pull_request:
+
+jobs:
+  # Run pre-commit checks on the files changed
+  linting:
+    runs-on: ubuntu-latest
+    name: Linting
+    steps:
+      - name: Checkout Code
+        uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+
+      - name: Set up Python
+        uses: actions/setup-python@v2
+        with:
+          python-version: 3.9
+
+      - name: Install Dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install pre-commit
+          pre-commit install
+
+      - name: Run Checks
+        run: |
+          pre-commit run --from-ref origin/${{ github.base_ref }} --to-ref HEAD --show-diff-on-failure
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,26 @@
+# See https://pre-commit.com for more information
+# See https://pre-commit.com/hooks.html for more hooks
+repos:
+-   repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v3.2.0
+    hooks:
+    -   id: trailing-whitespace
+    -   id: end-of-file-fixer
+    -   id: check-yaml
+    -   id: check-added-large-files
+    -   id: check-json
+    -   id: check-merge-conflict
+-   repo: https://github.com/psf/black
+    rev: 22.6.0
+    hooks:
+    -   id: black
+-   repo: https://github.com/PyCQA/flake8
+    rev: 5.0.4
+    hooks:
+    -   id: flake8
+-   repo: https://github.com/pre-commit/pygrep-hooks
+    rev: v1.9.0
+    hooks:
+    -   id: rst-backticks
+    -   id: rst-directive-colons
+    -   id: rst-inline-touching-normal
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -94,7 +94,18 @@ Once forked, use the following steps to get your development environment set up
     $ pip install -r docs/requirements.txt
     ```
 
-4. Switch to the develop branch.
+4. (Optional) Set up pre-commit hooks.
+
+    When opening a PR in the Yellowbrick repository, a series of checks will be run on your contribution, some of which lint and look at the formatting of your code. These may indicate some changes that need to be made before your contribution can be reviewed. You can set up pre-commit hooks to run these checks locally upon running `git commit` to ensure your contribution will pass formatting and linting checks. To set this up, you will need to uncomment the pre-commit line in `requirements.txt` and then run the following commands:
+
+    ```
+    $ pip install -r requirements.txt
+    $ pre-commit install
+    ```
+
+    The next time you run `git commit` in the Yellowbrick repository, the checks will automatically run.
+
+5. Switch to the develop branch.
 
     The Yellowbrick repository has a `develop` branch that is the primary working branch for contributions. It is probably already the branch you're on, but you can make sure and switch to it as follows::
 

diff --git a/MAINTAINERS.md b/MAINTAINERS.md
@@ -13,17 +13,18 @@ For everyone who has [contributed](https://github.com/DistrictDataLabs/yellowbri
 This is a list of the primary project maintainers. Feel free to @ message them in issues and converse with them directly.
 
 - [bbengfort](https://github.com/bbengfort)
-- [ndanielsen](https://github.com/ndanielsen)
+- [rebeccabilbro](https://github.com/rebeccabilbro)
 - [lwgray](https://github.com/lwgray)
-- [NealHumphrey](https://github.com/NealHumphrey)
-- [jkeung](https://github.com/jkeung)
 - [pdamodaran](https://github.com/pdamodaran)
 
 ## Core Contributors
 
 This is a list of the core-contributors of the project. Core contributors set the road map and vision of the project. Keep an eye out for them in issues and check out their work to use as inspiration! Most likely they would also be happy to chat and answer questions.
 
-- [rebeccabilbro](https://github.com/rebeccabilbro)
+- [pdeziel](https://github.com/pdeziel)
+- [ndanielsen](https://github.com/ndanielsen)
+- [NealHumphrey](https://github.com/NealHumphrey)
+- [jkeung](https://github.com/jkeung)
 - [mattandahalfew](https://github.com/mattandahalfew)
 - [tuulihill](https://github.com/tuulihill)
 - [balavenkatesan](https://github.com/balavenkatesan)

diff --git a/README.md b/README.md
@@ -7,6 +7,7 @@
 [![Language Grade: Python](https://img.shields.io/lgtm/grade/python/g/DistrictDataLabs/yellowbrick.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/DistrictDataLabs/yellowbrick/context:python)
 [![PyPI version](https://badge.fury.io/py/yellowbrick.svg)](https://badge.fury.io/py/yellowbrick)
 [![Documentation Status](https://readthedocs.org/projects/yellowbrick/badge/?version=latest)](http://yellowbrick.readthedocs.io/en/latest/?badge=latest)
+[![Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1206239.svg)](https://doi.org/10.5281/zenodo.1206239)
 [![JOSS](http://joss.theoj.org/papers/10.21105/joss.01075/status.svg)](https://doi.org/10.21105/joss.01075)
 [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/DistrictDataLabs/yellowbrick/develop?filepath=examples%2Fexamples.ipynb)

diff --git a/docs/api/features/rankd.rst b/docs/api/features/rankd.rst
@@ -29,7 +29,7 @@ A one-dimensional ranking of features utilizes a ranking algorithm that takes in
     # Load the credit dataset
     X, y = load_credit()
 
-    # Instantiate the 1D visualizer with the Sharpiro ranking algorithm
+    # Instantiate the 1D visualizer with the Shapiro ranking algorithm
     visualizer = Rank1D(algorithm='shapiro')
 
     visualizer.fit(X, y)           # Fit the data to the visualizer

diff --git a/docs/api/model_selection/dropping_curve.rst b/docs/api/model_selection/dropping_curve.rst
@@ -0,0 +1,79 @@
+.. -*- mode: rst -*-
+
+Feature Dropping Curve
+=============================
+
+ =================   =====================
+ Visualizer           :class:`~yellowbrick.model_selection.dropping_curve.DroppingCurve`
+ Quick Method         :func:`~yellowbrick.model_selection.dropping_curve.dropping_curve`
+ Models               Classification, Regression, Clustering
+ Workflow             Model Selection
+ =================   =====================
+
+A feature dropping curve (FDC) shows the relationship between the score and the number of features used.
+This visualizer randomly drops input features, showing how the estimator benefits from additional features of the same type.
+For example, how many air quality sensors are needed across a city to accurately predict city-wide pollution levels?
+
+Feature dropping curves helpfully complement :doc:`rfecv` (RFECV).
+In the air quality sensor example, RFECV finds which sensors to keep in the specific city.
+Feature dropping curves estimate how many sensors a similar-sized city might need to track pollution levels.
+
+Feature dropping curves are common in the field of neural decoding, where they are called `neuron dropping curves <https://dx.doi.org/10.3389%2Ffnsys.2014.00102>`_ (`example <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8293867/figure/F3/>`_, panels C and H).
+Neural decoding research often quantifies how performance scales with neuron (or electrode) count.
+Because neurons do not correspond directly between participants, we use random neuron subsets to simulate what performance to expect when recording from other participants.
+
+To show how this works in practice, consider an image classification example using `handwritten digits <https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits>`_.
+
+.. plot::
+    :context: close-figs
+    :alt: Dropping Curve on the digits dataset
+
+    from sklearn.svm import SVC
+    from sklearn.datasets import load_digits
+
+    from yellowbrick.model_selection import DroppingCurve
+
+    # Load dataset
+    X, y = load_digits(return_X_y=True)
+
+    # Initialize visualizer with estimator
+    visualizer = DroppingCurve(SVC())
+
+    # Fit the data to the visualizer
+    visualizer.fit(X, y)
+    # Finalize and render the figure
+    visualizer.show()
+
+This figure shows an input feature dropping curve.
+Since the features are informative, the accuracy increases with more larger feature subsets.
+The shaded area represents the variability of cross-validation, one standard deviation above and below the mean accuracy score drawn by the curve.
+
+The visualization can be interpreted as the performance if we knew some image pixels were corrupted.
+As an alternative interpretation, the dropping curve roughly estimates the accuracy if the image resolution was downsampled.
+
+Quick Method
+------------
+The same functionality can be achieved with the associated quick method ``dropping_curve``. This method will build the ``DroppingCurve`` with the associated arguments, fit it, then (optionally) immediately show the visualization.
+
+.. plot::
+    :context: close-figs
+    :alt: Dropping Curve Quick Method on the digits dataset
+
+    from sklearn.svm import SVC
+    from sklearn.datasets import load_digits
+
+    from yellowbrick.model_selection import dropping_curve
+
+    # Load dataset
+    X, y = load_digits(return_X_y=True)
+
+    dropping_curve(SVC(), X, y)
+
+
+API Reference
+-------------
+
+.. automodule:: yellowbrick.model_selection.dropping_curve
+    :members: DroppingCurve, dropping_curve
+    :undoc-members:
+    :show-inheritance:
diff --git a/docs/api/model_selection/index.rst b/docs/api/model_selection/index.rst
@@ -14,6 +14,7 @@ The currently implemented model selection visualizers are as follows:
 -  :doc:`cross_validation`: displays cross-validated scores as a bar chart with average as a horizontal line.
 -  :doc:`importances`: rank features by relative importance in a model
 -  :doc:`rfecv`: select a subset of features by importance
+-  :doc:`dropping_curve`: select subsets of features randomly
 
 Model selection makes heavy use of cross validation to measure the performance of an estimator. Cross validation splits a dataset into a training data set and a test data set; the model is fit on the training data and evaluated on the test data. This helps avoid a common pitfall, overfitting, where the model simply memorizes the training data and does not generalize well to new or unknown input.
 
@@ -27,3 +28,4 @@ There are many ways to define how to split a dataset for cross validation. For m
    cross_validation
    importances
    rfecv
+   dropping_curve
diff --git a/docs/api/model_selection/validation_curve.rst b/docs/api/model_selection/validation_curve.rst
@@ -40,6 +40,8 @@ In our first example, we'll explore using the ``ValidationCurve`` visualizer wit
     viz.fit(X, y)
     viz.show()
 
+To further customize this plot, the visualizer also supports a ``markers`` parameter that changes the marker style.
+
 After loading and wrangling the data, we initialize the ``ValidationCurve`` with a ``DecisionTreeRegressor``. Decision trees become more overfit the deeper they are because at each level of the tree the partitions are dealing with a smaller subset of data. One way to deal with this overfitting process is to limit the depth of the tree. The validation curve explores the relationship of the ``"max_depth"`` parameter to the R2 score with 10 shuffle split cross-validation. The ``param_range`` argument specifies the values of ``max_depth``, here from 1 to 10 inclusive.
 
 We can see in the resulting visualization that a depth limit of less than 5 levels severely underfits the model on this data set because the training score and testing score climb together in this parameter range, and because of the high variability of cross validation on the test scores. After a depth of 7, the training and test scores diverge, this is because deeper trees are beginning to overfit the training data, providing no generalizability to the model. However, because the cross validation score does not necessarily decrease, the model is not suffering from high error due to variance.

diff --git a/docs/api/text/correlation.rst b/docs/api/text/correlation.rst
@@ -0,0 +1,63 @@
+.. -*- mode: rst -*-
+
+Word Correlation Plot
+=====================
+
+Word correlation illustrates the extent to which words or phrases co-appear across the documents in a corpus. This can be useful for understanding the relationships between known text features in a corpus with many documents. ``WordCorrelationPlot`` allows for the visualization of the document occurrence correlations between select words in a corpus. For a number of features n, the plot renders an n x n heatmap containing correlation values.
+
+The correlation values are computed using the `phi coefficient <https://en.wikipedia.org/wiki/Phi_coefficient>`_ metric, which is a measure of the association between two binary variables. A value close to 1 or -1 indicates that the occurrences of the two features are highly positively or negatively correlated, while a value close to 0 indicates no relationship between the two features.
+
+=================   ==============================
+Visualizer           :class:`~yellowbrick.text.correlation.WordCorrelationPlot`
+Quick Method         :func:`~yellowbrick.text.correlation.word_correlation()`
+Models               Text Modeling
+Workflow             Feature Engineering
+=================   ==============================
+
+.. plot::
+    :context: close-figs
+    :alt: Word Correlation Plot
+
+    from yellowbrick.datasets import load_hobbies
+    from yellowbrick.text.correlation import WordCorrelationPlot
+
+    # Load the text corpus
+    corpus = load_hobbies()
+
+    # Create the list of words to plot
+    words = ["Tatsumi Kimishima", "Nintendo", "game", "play", "man", "woman"]
+
+    # Instantiate the visualizer and draw the plot
+    viz = WordCorrelationPlot(words)
+    viz.fit(corpus.data)
+    viz.show()
+
+
+Quick Method
+------------
+
+The same functionality above can be achieved with the associated quick method `word_correlation`. This method will build the Word Correlation Plot object with the associated arguments, fit it, then (optionally) immediately show the visualization.
+
+.. plot::
+    :context: close-figs
+    :alt: Word Correlation Plot
+
+    from yellowbrick.datasets import load_hobbies
+    from yellowbrick.text.correlation import word_correlation
+
+    # Load the text corpus
+    corpus = load_hobbies()
+
+    # Create the list of words to plot
+    words = ["Game", "player", "score", "oil"]
+
+    # Draw the plot
+    word_correlation(words, corpus.data)
+
+API Reference
+-------------
+
+.. automodule:: yellowbrick.text.correlation
+    :members: WordCorrelationPlot, word_correlation
+    :undoc-members:
+    :show-inheritance:
diff --git a/docs/api/text/index.rst b/docs/api/text/index.rst
@@ -11,6 +11,7 @@ We currently have five text-specific visualizations implemented:
 -  :doc:`tsne`: plot similar documents closer together to discover clusters
 -  :doc:`umap_vis`: plot similar documents closer together to discover clusters
 -  :doc:`dispersion`: plot the dispersion of target words throughout a corpus
+-  :doc:`correlation`: plot the correlation between target words across the documents in a corpus
 -  :doc:`postag`: plot the counts of different parts-of-speech throughout a tagged corpus
 
 Note that the examples in this section require a corpus of text data, see :doc:`the hobbies corpus <../datasets/hobbies>` for a sample dataset.
@@ -21,6 +22,7 @@ Note that the examples in this section require a corpus of text data, see :doc:`
     from yellowbrick.text import TSNEVisualizer
     from yellowbrick.text import UMAPVisualizer
     from yellowbrick.text import DispersionPlot
+    from yellowbrick.text import WordCorrelationPlot
     from yellowbrick.text import PosTagVisualizer
 
     from sklearn.feature_extraction.text import TfidfVectorizer
@@ -33,4 +35,5 @@ Note that the examples in this section require a corpus of text data, see :doc:`
    tsne
    umap_vis
    dispersion
+   correlation
    postag
diff --git a/docs/changelog.rst b/docs/changelog.rst
@@ -3,6 +3,40 @@
 Changelog
 =========
 
+Version 1.5
+-----------
+
+* Tag: v1.5_
+* Deployed Sunday, August 21, 2022
+* Current Contributors: Stefanie Molin, Prema Roman, Sangam Swadik, David Gilbertson, Larry Gray, Benjamin Bengfort, @admo1, @charlesincharge, Uri Nussbaum, Patrick Deziel, Rebecca Bilbro
+
+Major
+   - Added ``WordCorrelationPlot`` Visualizer
+   - Built tests for using sklearn pipeline with visualizers
+   - Allowed Marker Style to be specified in Validation Curve Visualizer
+   - Fixed ``get_params`` for estimator wrapper to prevent ``AttributeError``
+   - Updated missing values visualizer to handle multiple data types and work on both numpy arrays and pandas data frames.
+   - Added pairwise distance metrics to scoring metrics in KElbowVisualizer
+Minor
+   - Pegged Numba to v0.55.2
+   - Updated Umap to v0.5.3
+   - Fixed Missing labels in classification report visualizer
+   - Updated Numpy to v1.22.0
+Documentation
+   - The Spanish language Yellowbrick docs are now live: https://www.scikit-yb.org/es/latest/
+   - Added Dropping curve documentation
+   - Added new example Notebook for Regression Visualizers
+   - Fixed Typo in PR section of getting started docs
+   - Fixed Typo in rank docs
+   - Updated docstring in kneed.py utility file
+   - Clarified how to run ‘make html’ in PR template
+Infrastructure
+   - Added ability to run linting Actions on PRs
+   - Implemented black code formatting as pre-commit hook
+
+.. _v1.5: https://github.com/DistrictDataLabs/yellowbrick/releases/tag/v1.5
+
+
 Version 1.4
 -----------