Overview

Web scraper to scrape COLIN-UI and download all the filing outputs of legacy businesses and transfer them into LEAR
RFC
design for LEAR side

Prerequisites

fill .env in root directory, configMap.yaml under scripts/deployments, and tnsnames.ora under config in root directory
create test-outputs folder in the root directory if not already present
linux environment to clone and run app, ie: WSL2 with ubuntu 20.04 installed
Gov VPN installed and running to connect to oracle DB
minikube for local kubernetes deployment
docker desktop installed and enabled in WSL2, to manage containers
run make setup to setup venv with dev requirements

common errors

DPI-1047: Cannot locate a 64-bit Oracle Client library: Oracle instant client is not installed. Will encounter this if you're running the app outside of docker compose and you don't have the oracle instant client installed on your machine
- fixed by running app through docker compose
ORA-12545: Connect failed because target host or object does not exist: usually because VPN is not running when running app or deployment
- fixed by turning on VPN and rerunning app
sometimes selenium-grid may throw a bind(): failed error
- usually resolved by restarting computer

Running the app

set command_executor in scraper.py to http://selenium:4444/wd/hub
run make dev in root directory after which, this only needs to be run if the dockerfile is changed,
- for subsequent runs you can use docker compose up
colin-scraper-app will usually crash on startup since it doesn't wait for a chrome node to be setup by selenium grid.
- a workaround is to go into docker desktop and restart the container
you should now see 2 dates followed by business numbers being logged
to input dates
- update DATE_RANGE_START, DATE_RANGE_END, and FINAL_END_DATE env vars
- run docker compose up

Kubernetes Deployment

start kubernetes cluster ie: minikube start
run eval $(minikube -p minikube docker-env)
set command_executor to http://selenium-hub:4444/wd/hub
run make local-deploy this sets up local selenium-hub and scraper deployments
warnings about the oracle-instantclient are normal and expected here
use kubectl commands to explore deployment

Implementation Details

This application connects to COLIN's Oracle DB to query the events table for corp nums and filing events between a specified time interval.
It then navigates through COLIN UI and searches each queried corp num using selenium.
For each corp searched it:

harvests all hrefs for outputs attached to filings done within the specified time interval using BS
makes asynchronous download requests for all harvested hrefs.
these request return PDF data which is stored in memory
this data is temporarily written into PDF files,
a. filing event id, date filled, and name are all available alongside the PDF data
b. in the end we want this data to be sent to LEAR's Doc Store

After all corps with filing events have been visited the time interval is stepped up by a year
This approach to gathering outputs ensures that we aren't redownloading any outputs and we aren't revisiting corps unnecessarily.
It also catches any new filing events made during or after the bot runs because we're querying through filing events which means that when a new filing is made we can just start up the bot and tell it to run from when it last ran to the present day and it will grab all the filing events in that time interval and download all the associated outputs.
Furthermore, scannings of old paper filngs into digital filings are also caught since they create filing events with a timestamp on the day they were scanned. So the bot can just be ran again to catch those.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.vscode		.vscode
instant-client-installs		instant-client-installs
scripts		scripts
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
docker-compose.yaml		docker-compose.yaml
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Prerequisites

common errors

Running the app

Kubernetes Deployment

Implementation Details

About

Releases

Packages

Languages

MatthewCai2002/colin-scraper

Folders and files

Latest commit

History

Repository files navigation

Overview

Prerequisites

common errors

Running the app

Kubernetes Deployment

Implementation Details

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages