Web scraper to scrape COLIN-UI and download all the filing outputs of legacy businesses and transfer them into LEAR
RFC
design for LEAR side
- fill .env in root directory, configMap.yaml under scripts/deployments, and tnsnames.ora under config in root directory
- create
test-outputs
folder in the root directory if not already present - linux environment to clone and run app, ie: WSL2 with ubuntu 20.04 installed
- Gov VPN installed and running to connect to oracle DB
- minikube for local kubernetes deployment
- docker desktop installed and enabled in WSL2, to manage containers
- run
make setup
to setup venv with dev requirements
- DPI-1047: Cannot locate a 64-bit Oracle Client library: Oracle instant client is not installed. Will encounter this if you're running the app outside of docker compose and you don't have the oracle instant client installed on your machine
- fixed by running app through docker compose
- ORA-12545: Connect failed because target host or object does not exist: usually because VPN is not running when running app or deployment
- fixed by turning on VPN and rerunning app
- sometimes selenium-grid may throw a bind(): failed error
- usually resolved by restarting computer
- set command_executor in scraper.py to http://selenium:4444/wd/hub
- run
make dev
in root directory after which, this only needs to be run if the dockerfile is changed,- for subsequent runs you can use
docker compose up
- for subsequent runs you can use
- colin-scraper-app will usually crash on startup since it doesn't wait for a chrome node to be setup by selenium grid.
- a workaround is to go into docker desktop and restart the container
- you should now see 2 dates followed by business numbers being logged
- to input dates
- update DATE_RANGE_START, DATE_RANGE_END, and FINAL_END_DATE env vars
- run
docker compose up
- start kubernetes cluster ie:
minikube start
- run
eval $(minikube -p minikube docker-env)
- set command_executor to http://selenium-hub:4444/wd/hub
- run
make local-deploy
this sets up local selenium-hub and scraper deployments
warnings about the oracle-instantclient are normal and expected here - use kubectl commands to explore deployment
This application connects to COLIN's Oracle DB to query the events table for corp nums and filing events between a specified time interval.
It then navigates through COLIN UI and searches each queried corp num using selenium.
For each corp searched it:
- harvests all hrefs for outputs attached to filings done within the specified time interval using BS
- makes asynchronous download requests for all harvested hrefs.
- these request return PDF data which is stored in memory
- this data is temporarily written into PDF files,
a. filing event id, date filled, and name are all available alongside the PDF data
b. in the end we want this data to be sent to LEAR's Doc Store
After all corps with filing events have been visited the time interval is stepped up by a year
This approach to gathering outputs ensures that we aren't redownloading any outputs and we aren't revisiting corps unnecessarily.
It also catches any new filing events made during or after the bot runs because we're querying through filing events which means that when a new filing is made we can just start up the bot and tell it to run from when it last ran to the present day and it will grab all the filing events in that time interval and download all the associated outputs.
Furthermore, scannings of old paper filngs into digital filings are also caught since they create filing events with a timestamp on the day they were scanned. So the bot can just be ran again to catch those.