Skip to content

registerguard/dt-story-out

Repository files navigation

Python project to do the following:

Preliminary research

Gallery subcats: 32058824,32067956,32058769,32058823,32058773,32058730,32058735,32058827,32058745,31994432,32058749,32058736,32058195,32058817,32068003,32058784,32058756 Video subcats: 31994433,32058759,32058196,32003307,32058816,32042463,32042459,32058748,31994425,32058825,32042583,32042464,32058820,32058774,32042460,32003311,32058826,32058162

There are between 337 and 360 stories slotted with the gallery subcategories. I would guess 337 is the "more" accurate value. The 360 number could include stories that are not "published" officially.

  • 337 is the number returned by the json API
  • 360 is the number returned by a db query: SELECT storyId FROM dbo.Story WHERE Story.subCategoryId in (32058824,32067956,32058769,32058823,32058773,32058730,32058735,32058827,32058745,31994432,32058749,32058736,32058195,32058817,32068003,32058784,32058756)

Side note: When you do the JSON query, there is an error returned.

Wrote out first CSV file on 3/14.

Story data structure from DT API looks like this (clean and raw):

  • byline
  • category
  • cat id
  • count (total stories returned)
  • deck
  • headline
  • path (full URL)
  • publish date (YYYY-MM-DD HH:MM:SS)
  • server (???)
  • author
  • excerpt
  • sspid
  • image (DT 990 file)
[{'byline': 'The Register-Guard Staff', 'category': 'Sports', 'catid': '32058784', 'count': 1, 'deck': 'From Willamette High to the NCAA Tournament, half a decadeof Lexi Bando in pictures', 'headline': 'The Bando bandwagon', 'path': 'http://registerguard.com/rg/photo/36550000-321/the-bando-bandwagon.html.csp', 'published':'2018-03-14 14:30:01', 'server': '', 'author': 'The Register-Guard Staff', 'excerpt': '', 'sspid': '679', 'image': 'http://registerguard.com/csp/cms/sites/dt.common.streams.StreamServer.cls?STREAMOID=I6uQtLL7vX76Kmq_8xB8Y8$daE2N3K4ZzOUsqbU5sYuSguc0JOxw0CVWS_uWjGHFWCsjLu883Ygn4B49Lvm9bPe2QeMKQdVeZmXF$9l$4uCZ8QDXhaHEp3rvzXRJFdy0KqPHLoMevcTLo3h8xh70Y6N_U_CryOsw6FTOdKL_jpQ-&CONTENTTYPE=image/jpeg'}]

Example gallery field mapping for NCS

<?xml version="1.0" encoding="UTF-8" ?>
<gallery>
	<uniqueid></uniqueid>
	<title><![CDATA[]]></title>
	<date></date>
    <category>PHOTOGALLERY</category>
	<taxonomies></taxonomies>
	<description>GALLERY CAPTION</description>
	<images>
        <image>
            <title><![CDATA[]]></title>
            <caption><![CDATA[]]></caption>
            <credit><![CDATA[]]></credit>
            <filename></filename>
        </image>
	</images>
	<seo-label></seo-label>
</gallery>

Data coming out of SSP API:

  • image (990)
  • thumb (150)
  • byline (name only)
  • description (caption)
  • id (unique ssp id)
  • filename (original filename)
  • original (original image file)

This worked to download image:

r = requests.get(url, stream=True, verify=False)
path = '/Users/rdenton/Desktop/test.jpg'
if r.status_code = 200:
	with open(path, 'wb') as f:
	    r.raw.decode_content = True
	    shutil.copyfileobj(r.raw,f)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages