-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scraper cleanup updates: added type annotation, logging, doc strings, and error handling #141
Conversation
src/scraper/__init__.py
Outdated
end_date: Optional[str] = None, | ||
court_calendar_link_text: Optional[str] = None, | ||
case_number: Optional[str] = None | ||
) -> Tuple[int, str, str, str, Optional[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this tuple type totally works and is very usable, but you might consider implementing a config class to hold all of these in a named way. That might look something like this:
from dataclasses import dataclass
@dataclass
class ScraperConfig:
ms_wait: Optional[int] = None
start_date: str | None = None # this is an alternate way to do the type btw, some people prefer it. Optonal[T] == T | None
end_date: str | None = None
...
Then you can use it like my_config = Config(ms_wait=100, end_date='2024-01-01', ...)
and access the fields like my_config.ms_wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sick. Okay. I'll figure this out and implement it in a separate PR for the scraper module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I tried implementing this on the set_defaults piece and got lost in the sauce. I'll have to ask you about this next time we chat.
src/scraper/__init__.py
Outdated
TypeError: If the provided county name is not a string. | ||
""" | ||
if not isinstance(county, str): | ||
raise TypeError("The county name must be a string.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you can definitely do this kind of type guarding, but it's not commonly used unless you expect people to be using the code from outside the module, and you have a strong reason you need a specific type. If you run mypy on the code during build (I can set up a build pipeline btw) it will catch any silly mistakes like passing an int in here, with no runtime check needed. That said, feel free to keep them if you like them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I'll remove it to keep things cleaner.
Creates and configures a requests session for interacting with web pages. | ||
|
||
This method sets up a `requests.Session` with SSL verification disabled and suppresses | ||
related warnings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the ssl verify an optional parameter, and only disable warnings if the user requests no verification. I believe we can turn verification on, and it's a good practice to do so, but we want to keep the ability to turn it off again if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll turn it back on and test and see what happens. Plus I'll make it an optional parameter
src/scraper/__init__.py
Outdated
Raises: | ||
OSError: If there is an error creating the directories. | ||
""" | ||
case_html_path = os.path.join(os.path.dirname(__file__), "..", "..", "data", county, "case_html") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, using __file__
to find other files is not recommended. The reason is that if someone wants to import this code and use it, they have to place the input files in some weird directory where your code is, rather than somewhere convenient for them. I would recommend making the path to the data
folder an argument to this module, so that people can just supply a path that works for them. You can default to something like ./data
, which will be relative to where you run python
from, rather than where this source file is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I'll make this case_html_path a parameter you can pass it and will default to this directory.
src/scraper/__init__.py
Outdated
|
||
Raises: | ||
Exception: If the county is not found in the CSV file or if required data is missing, an exception is raised | ||
and logged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a standard format for this kind of docstring, the one I'm familiar with would look like this. The advantage of using this format is that we can run https://www.sphinx-doc.org/en/master/ to make a docs site for our code automatically, and it will do lots of nice things for you if you follow a format it knows.
"""
One line summary
Longer description body, lots of words words
words, multiline potentially.
:param county: the country to parse. (note: no type needed, that's in the type annotation already so the tool will grab it)
:param logger: the logger to use
:returns: a tuple of ... (same thing, no type needed unless it makes the language smoother)
:raises Exception: on these conditions (prefer a subtype of Exception, define your own if you need to, so people have something specific to catch)
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll update all of the doc strings to match this new format.
if not search_page_id: | ||
write_debug_and_quit( | ||
verification_text="Court Calendar link", | ||
page_text=main_page_html, | ||
logger=logger, | ||
) | ||
search_url = base_url + "Search.aspx?ID=" + search_page_id | ||
raise ValueError("Court Calendar link not found on the main page.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just highlighting that this is not actually using a default value if one isn't found, which isn't consistent with the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof. I'll make sure to fix this. Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually wait. Do you mean not using a default search_url or a default base_url?
date_string = datetime.strftime(date, "%m/%d/%Y") | ||
# loop through each judicial officer | ||
|
||
for date in (start_date + timedelta(n) for n in range((end_date - start_date).days + 1)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this kind of nested iterator is confusing to me as a reader, I had to think for a minute to grok this. I think this might be more readable:
for n in range((end_date - start_date).days + 1)):
date = start_date + timedelta(n)
...
src/scraper/__init__.py
Outdated
) | ||
|
||
scraper_instance, scraper_function = self.get_class_and_method(county, logger) | ||
if scraper_instance and scraper_function: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the get_class_and_method
can't return None
anymore, you can just assume the results are there and remove this if and the error log. The exception will bubble up if it happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great! I'll remove it.
Co-authored-by: Matt Allen <matt_allen@utexas.edu>
Co-authored-by: Matt Allen <matt_allen@utexas.edu>
@Matt343 Hi there Matt! Any more changes or thoughts before approving this? |
Making the scraper clearer and easier to maintain with documentation, logging, error handling, and type annotation
It makes the code a lot longer though.