Scraper cleanup updates: added type annotation, logging, doc strings, and error handling #141

nicolassaw · 2024-09-14T20:13:40Z

Making the scraper clearer and easier to maintain with documentation, logging, error handling, and type annotation

It makes the code a lot longer though.

Matt343 · 2024-09-15T17:52:26Z

src/scraper/__init__.py

+        end_date: Optional[str] = None, 
+        court_calendar_link_text: Optional[str] = None, 
+        case_number: Optional[str] = None
+    ) -> Tuple[int, str, str, str, Optional[str]]:


So this tuple type totally works and is very usable, but you might consider implementing a config class to hold all of these in a named way. That might look something like this:

from dataclasses import dataclass @dataclass class ScraperConfig: ms_wait: Optional[int] = None start_date: str | None = None # this is an alternate way to do the type btw, some people prefer it. Optonal[T] == T | None end_date: str | None = None ...

Then you can use it like my_config = Config(ms_wait=100, end_date='2024-01-01', ...) and access the fields like my_config.ms_wait

Sick. Okay. I'll figure this out and implement it in a separate PR for the scraper module.

Okay I tried implementing this on the set_defaults piece and got lost in the sauce. I'll have to ask you about this next time we chat.

Matt343 · 2024-09-15T17:57:13Z

src/scraper/__init__.py

+            TypeError: If the provided county name is not a string.
+        """
+        if not isinstance(county, str):
+            raise TypeError("The county name must be a string.")


So you can definitely do this kind of type guarding, but it's not commonly used unless you expect people to be using the code from outside the module, and you have a strong reason you need a specific type. If you run mypy on the code during build (I can set up a build pipeline btw) it will catch any silly mistakes like passing an int in here, with no runtime check needed. That said, feel free to keep them if you like them

Sounds good! I'll remove it to keep things cleaner.

Matt343 · 2024-09-15T17:59:43Z

src/scraper/__init__.py

+        Creates and configures a requests session for interacting with web pages.
+
+        This method sets up a `requests.Session` with SSL verification disabled and suppresses 
+        related warnings.


Let's make the ssl verify an optional parameter, and only disable warnings if the user requests no verification. I believe we can turn verification on, and it's a good practice to do so, but we want to keep the ability to turn it off again if needed.

I'll turn it back on and test and see what happens. Plus I'll make it an optional parameter

Matt343 · 2024-09-15T18:02:16Z

src/scraper/__init__.py

+        Raises:
+            OSError: If there is an error creating the directories.
+        """
+        case_html_path = os.path.join(os.path.dirname(__file__), "..", "..", "data", county, "case_html")


In general, using __file__ to find other files is not recommended. The reason is that if someone wants to import this code and use it, they have to place the input files in some weird directory where your code is, rather than somewhere convenient for them. I would recommend making the path to the data folder an argument to this module, so that people can just supply a path that works for them. You can default to something like ./data, which will be relative to where you run python from, rather than where this source file is.

Sounds good! I'll make this case_html_path a parameter you can pass it and will default to this directory.

Matt343 · 2024-09-15T18:07:24Z

src/scraper/__init__.py

+
+        Raises:
+            Exception: If the county is not found in the CSV file or if required data is missing, an exception is raised
+                    and logged.


There's a standard format for this kind of docstring, the one I'm familiar with would look like this. The advantage of using this format is that we can run https://www.sphinx-doc.org/en/master/ to make a docs site for our code automatically, and it will do lots of nice things for you if you follow a format it knows.

""" One line summary Longer description body, lots of words words words, multiline potentially. :param county: the country to parse. (note: no type needed, that's in the type annotation already so the tool will grab it) :param logger: the logger to use :returns: a tuple of ... (same thing, no type needed unless it makes the language smoother) :raises Exception: on these conditions (prefer a subtype of Exception, define your own if you need to, so people have something specific to catch) """

Thanks! I'll update all of the doc strings to match this new format.

Matt343 · 2024-09-15T18:26:41Z

src/scraper/__init__.py

        if not search_page_id:
            write_debug_and_quit(
                verification_text="Court Calendar link",
                page_text=main_page_html,
                logger=logger,
            )
-        search_url = base_url + "Search.aspx?ID=" + search_page_id
+            raise ValueError("Court Calendar link not found on the main page.")


Just highlighting that this is not actually using a default value if one isn't found, which isn't consistent with the docs

Oof. I'll make sure to fix this. Good catch!

Actually wait. Do you mean not using a default search_url or a default base_url?

src/scraper/__init__.py

Matt343 · 2024-09-15T18:40:17Z

src/scraper/__init__.py

-            date_string = datetime.strftime(date, "%m/%d/%Y")
-            # loop through each judicial officer
+
+        for date in (start_date + timedelta(n) for n in range((end_date - start_date).days + 1)):


So this kind of nested iterator is confusing to me as a reader, I had to think for a minute to grok this. I think this might be more readable:

for n in range((end_date - start_date).days + 1)): date = start_date + timedelta(n) ...

Matt343 · 2024-09-15T18:41:43Z

src/scraper/__init__.py

+                )
+
+                scraper_instance, scraper_function = self.get_class_and_method(county, logger)
+                if scraper_instance and scraper_function:


Since the get_class_and_method can't return None anymore, you can just assume the results are there and remove this if and the error log. The exception will bubble up if it happens

great! I'll remove it.

Co-authored-by: Matt Allen <matt_allen@utexas.edu>

nicolassaw · 2024-09-29T19:13:28Z

@Matt343 Hi there Matt! Any more changes or thoughts before approving this?

nicolassaw added 3 commits September 14, 2024 13:25

adding logging, typing, error handling to scraper

734e7a2

scraper: logging, typeing, error handling part 2

f985d84

final adding logging, typing, error handling to scraper

88b40fa

nicolassaw added the scraper label Sep 14, 2024

nicolassaw requested review from normaljosh and newswim September 14, 2024 20:13

This was referenced Sep 14, 2024

scraper: add error handling #103

Closed

scraper: add logging #102

Closed

scraper: add type specification to methods #104

Closed

Matt343 reviewed Sep 15, 2024

View reviewed changes

nicolassaw and others added 8 commits September 21, 2024 18:03

lowercase JO_id

2433364

Co-authored-by: Matt Allen <matt_allen@utexas.edu>

lowercase Dict[str, str]

d639fae

Co-authored-by: Matt Allen <matt_allen@utexas.edu>

Merge branch 'main' into scraper-cleanup-updates

71e35a3

lowercasing jo_id everywhere

f813355

making ssl parameter and setting default true

f02495a

making case_html_path an optional parameter

9279ef6

scraper: reformatting doc strings

e569d6a

scrpr cleanup: responding to code review comments

b2849f1

nicolassaw mentioned this pull request Sep 22, 2024

Scraper: Consider implementing a config class to handle type annotation in a named way #150

Open

Merge branch 'main' into scraper-cleanup-updates

93c0c72

Matt343 approved these changes Oct 6, 2024

View reviewed changes

nicolassaw merged commit 44f5598 into main Oct 6, 2024
2 checks passed

nicolassaw deleted the scraper-cleanup-updates branch October 6, 2024 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper cleanup updates: added type annotation, logging, doc strings, and error handling #141

Scraper cleanup updates: added type annotation, logging, doc strings, and error handling #141

nicolassaw commented Sep 14, 2024

Matt343 Sep 15, 2024

nicolassaw Sep 21, 2024

nicolassaw Sep 21, 2024

Matt343 Sep 15, 2024

nicolassaw Sep 21, 2024

Matt343 Sep 15, 2024

nicolassaw Sep 21, 2024 •

edited

Loading

Matt343 Sep 15, 2024

nicolassaw Sep 21, 2024

Matt343 Sep 15, 2024

nicolassaw Sep 21, 2024

Matt343 Sep 15, 2024

nicolassaw Sep 21, 2024

nicolassaw Sep 22, 2024

Matt343 Sep 15, 2024

Matt343 Sep 15, 2024

nicolassaw Sep 21, 2024

nicolassaw commented Sep 29, 2024

Scraper cleanup updates: added type annotation, logging, doc strings, and error handling #141

Scraper cleanup updates: added type annotation, logging, doc strings, and error handling #141

Conversation

nicolassaw commented Sep 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolassaw Sep 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolassaw commented Sep 29, 2024

nicolassaw Sep 21, 2024 •

edited

Loading