Extraction:
- fixes for metadata extraction with @felipehertzer (#295, #296), @andremacola (#282, #310), and @edkrueger (#303)
- pagetype and image urls added to metadata by @andremacola (#282, #310)
- add as_dict method to Document class with @edkrueger in #306
- XML output fix with @knit-bee in #315
- various smaller fixes: lists (#309), XPaths, metadata hardening
Navigation:
- transfer URL management to courlan.UrlStore (#232, #312)
- fixes for spider module
Maintenance:
- simplify code and extend tests
- underlying packages htmldate and courlan, update setup and docs
Extraction:
- XML output improvements with @knit-bee (#273, #274)
- extraction bugs fixed (#263, #266), more robust HTML doctype parsing
- adjust thresholds for link density in paragraphs
Metadata:
- improved title and sitename detection (#284)
- faster author, categories, domain name, and tags extraction
- fixes to author emoji regexes by @felipehertzer (#269)
Command-line interface:
- review argument consistency and add deprecation warnings (#261)
Setup:
- make download timeout configurable (#263)
- updated dependencies, use of faust-cchardet for Python 3.11
Impact on extraction and output format:
- better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258)
- XML: preserve list type as attribute (#229)
- XML TEI: better conformity with @knit-bee (#238, #242, #253, #254)
- faster text cleaning and shorter code (#237 with @deedy5, #245)
- metadata: add language when detector is activated (#224)
- metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235)
- TXT: change markdown formatting of headers by @LaundroMat (#257)
Smaller changes in convenience functions:
- add function to clear caches (#219)
- CLI: change exit code if download fails (#223)
- settings: use "\n" for multiple user agents by @k-sareen (#241)
Updates:
- docs updated (and #244 by @dsgibbons)
- package dependencies updated
- fast and robust
html2txt()
function added (#221) - more robust parsing (#228)
- fixed bugs in metadata extraction, with @felipehertzer in #213 & #226
- extraction about 10-20% faster, slightly better recall
- partial fixes for memory leaks (#216)
- docs extended and updated (#217, #225)
- prepared deprecation of old
process_record()
function - more stable processing with updated dependencies
- more efficient rules for extraction
- metadata: further attributes used (with @felipehertzer)
- better baseline extraction
- issues fixed: #202, #204, #205
- evaluation updated
--precision
and--recall
arguments added to the CLI- better text cleaning: paywalls and comments
- improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188
- further bugs fixed: #189, #192 (with @felipehertzer), #200
- efficiency: faster module loading and improved RAM footprint
- efficiency: replaced module readability-lxml by trimmed fork
- bug fixed: (#179, #180, #183, #184)
- improved baseline extraction
- cleaner metadata (with @felipehertzer)
- encodings: better detection, output NFC-normalized Unicode
- maintenance and performance: more efficient code
- bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others)
- prepare compatibility with upcoming Python 3.11
- changed default settings
- extended documentation
- compress HTML backup files & seamlessly open .gz files
- support JSON web feeds
- graphical user interface integrated into main package
- faster downloads: reviewed backoff, compressed data
- optional modules: downloads with
pycurl
, language identification withpy3langid
- bugs fixed (#111, #125, #132, #136, #140)
- minor optimizations and fixes by @vbarbaresi in #124 & #130
- fixed array with single or multiples entries on json extractor by @felipehertzer in #143
- code base refactored with @sourcery-ai #121, improved and optimized for Python 3.6+
- drop support for Python 3.5
- better, faster encoding detection: replaced
chardet
withcharset_normalizer
- faster execution: updated
justext
to 3.0 - better extraction of sub-elements in tables (#78, #90)
- more robust web feed parsing
- further defined precision- and recall-oriented settings
- license extraction in footers (#118)
- first precision- and recall-oriented presets defined
- improvements in authorship extraction (thanks @felipehertzer)
- requesting TXT output with formatting now results in Markdown format
- bugs fixed: notably extraction robustness and consistency (#109, #111, #113)
- setting for cookies in request headers (thanks @muellermartin)
- better date extraction thanks to htmldate update
- improved author extraction (thanks @felipehertzer!)
- bugs fixed: HTML element handling, HTML meta attributes, spider, CLI, ...
- docs updated and extended
- CLI: option names normalized (heed deprecation warnings), new option
explore
- focused crawling functions including politeness rules
- more efficient multi-threaded downloads + use as Python functions
- documentation extended
- bugs fixed: extraction and URL handling
- removed support for Python 3.4
- better handling of formatting, links and images, title type as attribute in XML formats
- more robust sitemaps and feeds processing
- more accurate extraction
- further consolidation: code simplified and bugs fixed
- extraction trade-off: slightly better recall
- code robustness: requests, configuration and navigation
- bugfixes: image data extraction
- improved link discovery and handling
- fixes in metadata extraction, feeds and sitemaps processing
- breaking change: the
extract
function now reads target format fromoutput_format
argument only - new extraction option: preserve links, CLI options re-ordered
- more opportunistic backup extraction
- customizable configuration file to parametrize extraction and downloads
- better handling of feeds and sitemaps
- additional CLI options: crytographic hash for file name, use Internet Archive as backup
- more precise extraction
- faster downloads:
requests
replaced with bareurllib3
and custom decoding - consolidation: bug fixes and improvements, many thanks to the issues reporters!
- added
bare_extraction
function returning Python variables - improved link discovery in feeds and sitemaps
- option to preserve image info
- fixes (many thanks to bug reporters!)
- link discovery in sitemaps
- compatibility with Python 3.9
- extraction coverage improved
- deduplication now optional
- bug fixes
- optional language detector changed:
langid
→pycld3
- helper function
bare_extraction()
- optional deduplication off by default
- better URL handling (
courlan
), more complete metadata - code consolidation (cleaner and shorter)
- extended and more convenient command-line options
- output in JSON format
- bug fixes
- faster and more robust text and metadata extraction
- more efficient batch processing (parallel processing, URL queues)
- extraction and processing of ATOM/RSS feeds
- complete command-line tool with corresponding options
- better metadata extraction and integration (XML & XML-TEI)
- more efficient processing
- output directory as CLI-option
- improved "fast" mode (accuracy and speed)
- better fallbacks with readability-lxml and justext
- metadata extraction added
- more robust processing (tests, encoding handling)
- support for Python 3.4 reactivated
- bugs in XML output and discarding sections solved
- new tests and documentation
- code base re-structured for clarity and readability
- streamlined HTML processing and conversion
- internal less-recently-used cache (LRU) for deduplication
- export as CSV
- better test coverage, extraction recall and precision
- further documentation (trafilatura.readthedocs.org)
- optional processing of text formatting
- more complete settings file
- added metadata to the XML output
- production of valid XML TEI for simple documents
- better handling of nested elements, quotes and tables
- validation of XML TEI documents
- bulk download and processing
- handling of line breaks
- element trimming simplified
- first release used in production and meant to be archived for reproducibility and citability
- better extraction precision
- optional dependencies
- bugs in parsing removed
- code profiling and speed-up
- tables included in extraction
- bypass justext in arguments
- better handling of non-p elements
- better handling of text nodes
- improvements in extraction recall
- first release, minimum viable package