Skip to content

Exploration of philosophical material in the New Zealand National Library Newspaper Open Data Pilot.

Notifications You must be signed in to change notification settings

JoshuaWilsonBlack/NPOD_Philosophy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Philosophy in Early New Zealand Newspapers

The files in this repository are work towards an investigation of philosophical content, broadly understood, in early New Zealand Newspaper writing (using the New Zealand National Library's Papers Past Open Data Pilot dataset https://natlib.govt.nz/about-us/open-data/papers-past-metadata/papers-past-newspaper-open-data-pilot)

The directories contain:

  • NPOD_Starter: the starter corpus from the National Library of New Zealand
  • classifiers: trained classification models (pickled)
  • dictionaries: dictionaries generated from various subsets of the corpus with gensim
  • lda_models: trained lda topic models
  • pickles: pickles of various subsets of the dataset. Note: some pickled corpora are too large for GitHub.
  • presentation: Latex code for project presentation
  • report: Latex code for project report.

The jupyter notebooks have the following roles:

  • 'Classifying texts.ipynb': code used to assign categorical labels to articles.
  • 'Entity Extraction *.ipynb': application of spaCy to extract named entities and proper nouns from corpora.
  • 'NaiveBayes_PhilosoClassification*.ipynb': application of Naive Bayes classifiers trained on labelled dataset and then applied to the corpus as a whole.
  • '*_exp.ipynb': Use of collocation, cooccurence, and concordancing to explore candidate corpora.
  • 'starter_topicmodels.ipynb': Use of gensim topic modelling to explore the 'Starter kit' of the dataset.
  • 'Religion and Evolution in the REL corpus.ipynb': what the filename says.
  • 'NZ Content': looking for NZ-specific content in the NB2 corpus.
  • 'Relabelling.ipynb': Proposals to improve labelling, begun but not completed.

Various scripts are also included:

  • 'NL_helpers.py': a set of helper functions used in the notebooks above
  • 'NL_topicmodels.py': a corpus class for use with gensim and helpers specifically for the topic modelling side of the project.
  • 'generate_corpus_df.py': script to go from dataset stored in tarballs to a collection of pickled pandas dataframes.
  • 'keywords_from_corpus.py': a script to search for keywords in the complete corpus using dataframes generated by 'generate_corpus_df.py'
  • 'cooccurrence.py': a script to generate cooccurrence scores for given terms and store the results in a dataframe. This is particularly useful for the Dash app (in a distinct github repository).
  • 'add_cooccurrence_terms.py': Used to add terms to already generated cooccurrence dataframes.
  • 'generate_*.py': scripts to generate various useful outputs.
  • 'corpus2markdown.py': Takes a corpus and saves it as a series of Markdown files with links to Papers Past website.

This repository contains almost all code I have used in the course of the project, but does not contain all of the data (too big for github). Much of the code is in rough-and-ready script form and has not been tidied to the point which would be required for a complete recreation of the project.

About

Exploration of philosophical material in the New Zealand National Library Newspaper Open Data Pilot.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published