Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Python 3.4+, Selenium 3.3 and Firefox 52esr #143

Closed
wants to merge 32 commits into from

Conversation

zackw
Copy link
Contributor

@zackw zackw commented Mar 17, 2017

This PR addresses issues #88, #92, #93, and maybe #101.

Selenium 2.x does not support Python 3, so it's necessary to bump the Selenium and Firefox versions, and that was actually the hardest part of the change. MITMProxy has been completely ripped out, but I kept the proxy_queue around in case it is desirable to add some other HTTP proxy in the future (it'll just always be None and do nothing). The only other dependency that hadn't been updated was PyAMF, which I forked and pruned down to just enough for what OpenWPM needs.

This passes the internal test suite, but it probably needs quite a bit more testing; I expect there is lots of stuff the internal test suite does not cover. I am happy to continue working with you on this, but I need to get back to the project that this is in aid of, so I may be slow, and I'm not going to do any more testing myself, beyond what my own project involves. Y'all are welcome to add patches on top of this, however is convenient for you. I may add patches on top of this if I turn up more problems in my own testing. (More bytes/str/unicode mixup bugs are practically certain to exist.)

TravisCI report for my branch: https://travis-ci.org/zackw/OpenWPM

Some notes on specific concerns from the issues and/or bits of the patch that may not seem obvious:

  • Yes, Python 2 is still supported. I am tempted to start a further branch that only supports 3.6+ and uses asyncio instead of threads, as this would make it much easier to integrate with my own code, but that's probably at least another man-month of coding, and from a project management perspective I'm sure you don't want to take that plunge all at once. 😉
  • The OpenWPM extension seems to load and run just fine, despite the concerns about extension signatures and e10s in OpenWPM can't be installed with Firefox 48+ release version #92. (I did notice something about "extension @openwpm incompatible with e10s" in firefox startup spew.)
  • I had to work around some bugs in GeckoDriver and Selenium. The one that worries me the most is that it copies the profile, geckodriver issue #299 implies that it's supposed to delete its copy when it exits, and profile archiving relies on it not doing that. My long-term recommendation would actually be to work directly with the Marionette client library, which would eliminate both Selenium and GeckoDriver from the equation, but I expect that would be a lot of work and might break existing OpenWPM crawler scripts. The Marionette API resembles the Selenium/Python API but is not actually a subset or superset, I don't think.
  • Selenium 3 does not expose itself to page JS, so toggling disable_webdriver_self_id no longer has any effect. There are a number of vestiges of that kept around until we're sure we don't need to go back to Selenium 2.
  • Selenium 3 needs both geckodriver and firefox to be in $PATH, so what we do now is make sure the top-level firefox-bin directory is at the head of $PATH if it exists, and install.sh copies geckodriver in there. If it doesn't exist, we go ahead and try to run system-installed firefox/geckodriver instead.
  • I un-disabled Tracking Protection but I did not actually test it. With the removal of MITMProxy, perhaps "behind the scenes" network traffic is less of a problem now?
  • I replaced Adblock Plus with uBlock Origin and Ghostery with Disconnect, because that seems to be what privacy-conscious users are doing these days. These are also not being tested. I can't spare the time to develop tests for them right now.
  • There are a number of places where I rejiggered the code just to make it easier to read. I apologize if this makes the diff harder to read.

zackw added 30 commits March 9, 2017 10:17
PIL is not a dependency of OpenWPM proper, so it would be inappropriate
to list it in requirements.txt, but it _is_ used by one of the tests.
Kludge a direct installation command into .travis.yml.
This should get us 90% of the way to Python 3 support.
This module was completely removed in Python 3.
This may mean install.sh now needs to install something called 'GeckoDriver',
but let's see if we can get away without it.
 * Rationalize import ordering in some files.
 * Don't run nontrivial code at module scope when invoked as __main__.
 * If jpm is not available, but the .xpi exists, don't bomb out.
Since Selenium 3.3 requires a 'geckodriver' executable in
the PATH, put <root_dir>/firefox-bin in the PATH if it exists,
and rely on PATH search to find 'firefox'.
 * Replace Adblock Plus with uBlock Origin
   (which does not need precached filter lists)
 * Replace Ghostery with Disconnect (ditto)
 * Update HTTPS Everywhere to latest version
We _might_ be down to just HTTP instrumentation problems at this point.
The behavior of `open(path, "a+")` differs between Python 2 and Python 3.
In the latter, it will try to seek to the end, and if this fails (e.g. if
`path` is a pipe) it will throw an exception.  To work around this we
have to monkey-patch selenium.webdriver.firefox.service.Service.
sqlite3 fetchall() has always returned an array of tuples even when
the query returns a single row; Python 2's sloppy cross-type comparisons
let us get away with it.
@englehardt
Copy link
Collaborator

Thanks so much for taking this on -- this has been a looming TODO I haven't been able to get started on. It may be a little while before I am able to review and further test your changes, but I just wanted to jump in now to say I appreciate the contribution.

If you run into any issues or have any questions as you continue work on your project, please feel free to reach out.

@cgoldberg
Copy link

Selenium 2.x does not support Python 3

FWIW, Selenium 2.x has supported Python 3 for several years.

@zackw
Copy link
Contributor Author

zackw commented Apr 24, 2017

@cgoldberg

FWIW, Selenium 2.x has supported Python 3 for several years.

The version that OpenWPM had pinned when I started this project (2.53) didn't. :-)

@englehardt
Copy link
Collaborator

@zackw I've been testing this over the past month or so and added a bunch of patches on top of your changes. Thanks again for the contributions! I'm going to close this in favor of #152.

@englehardt englehardt closed this Oct 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants