Usage: hayden <config file location>
That's pretty much it. As for the config file, it's simply a JSON file containing parameters and rules for Hayden to follow.
An example config can be found here: example-config-scraper.json
Config keys are case insensitive.
This contains configuration relating to an external website to scrape from.
Specifies the imageboard software of the website you're scraping from. Can be these types:
4chan
LynxChan
Vichan
InfinityNext
FoolFuuka
Meguca
PonyChan
ASPNetChan
The root URL of the imageboard you wish to scrape from. An example would be https://8chan.moe/
This is ignored and not required for the 4chan
type; it always scrapes from 4chan.org
An array of the boards you wish to scrape from the site; Hayden cannot automatically detect which boards exist on an imageboard.
It's keyed with the short name of the board you wish to scrape, and the value is an object containing more specific rules around the board.
All regexes are applied on an OR basis; any regex matching will cause the thread to be archived.
Available properties to apply to each board:
-
ThreadTitleRegex
A regex for filtering on the title/subject line of a thread. -
OPContentRegex
A regex for filtering on the opening post content of a thread. -
AnyFilter
A regex for filtering on the title/subject OR opening post content of a thread.
The amount of seconds Hayden should wait (at minimum) inbetween making API calls. (This is specifically per connection, including proxies). Can be a decimal number
The amount of seconds Hayden should wait at minimum before attempting to scrape the board thread listings again. If a single scrape run takes longer than this time, then the next board scrape will happen immediately. Can be a decimal number.
Specifies either true
or false
that Hayden should read the archives for each board on startup (only applicable to boards and imageboard software that support and have an archive). Obviously incurs a speed penalty for the initial scrape.
An array of proxies that can be used to allow parallelized scraping without incurring ban risk from polling too much from a single IP address.
Proxy objects look like this:
{
"url": "socks5://1.2.3.4:8000",
"username": "user",
"password": "pass"
}
May support HTTP proxies; I've only ever tested SOCKS5.
This contains configuration relating to where data should be saved to. See below for more detailed information about each consumer type.
The type of consumer that will be used to store data.
Filesystem
for flat-file JSON storageAsagi
for AsagiHayden
for Hayden's database format
The directory to where files should be downloaded to. The result directory structure is specific to each consumer type.
The type of database to connect to. Only applicable to Asagi
and Hayden
consumer types.
Can be these options:
MySQL
(this is also the case for MariaDB; Hayden will automatically determine if it's MySQL or MariaDB)Sqlite
The connection string to use to connect to the database. Format is specific to the database type; check the sample config for examples
Specifies if full-size images should be downloaded.
Specifies if thumbnail images should be downloaded. As of writing, Hayden
consumer type will not allow only downloading thumbnails without downloading full images.
There are currently 3 supported data stores:
-
Hayden
- A custom database schema intended to support exotic data structures from altchans (such as multiple image support) while also retaining support for 4chan. This is currently the only datastore that the Hayden webserver can read from. Should be considered stable, however there will likely be some schema changes / improvements in the future that will be forwards compatible.
-
Asagi
- The database schema used by Asagi and FoolFuuka. This may or may not be fully 100% functional anymore; it's been a long time since I've tested it but from the lack of issues I've heard about it, it seems to be working just fine. This uses the extended-length schema as of 2022 that supports the newer 4chan
tim
fields, and as such may have issues with older versions that don't have the same changes. While Asagi theoretically supports PostgreSQL, this only supports MySQL/MariaDB.
- The database schema used by Asagi and FoolFuuka. This may or may not be fully 100% functional anymore; it's been a long time since I've tested it but from the lack of issues I've heard about it, it seems to be working just fine. This uses the extended-length schema as of 2022 that supports the newer 4chan
-
JSON flat file
- Similar to what you would recieve when running something like gallery-dl. Creates a folder for each thread, and in it writes a metadata JSON file and each image (+ thumbnail).
This is different to just writing the returned API JSON document, as it does not keep track of deleted / modified posts. Hayden instead writes a custom format that it uses internally, with the original JSON document as a property.
- Similar to what you would recieve when running something like gallery-dl. Creates a folder for each thread, and in it writes a metadata JSON file and each image (+ thumbnail).
A table of which API frontends support which backends:
Yotsuba | LynxChan | Vichan/Infinity | InfinityNext | Meguca | FoolFuuka | Ponychan | ASPNetChan | |
---|---|---|---|---|---|---|---|---|
Asagi | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
Filesystem | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Hayden | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
For Hayden and Asagi data stores, they require a database to be able to operate. Hayden will create the tables for you, but not the database. If required, a specific database collation will be listed in the table below.
Supported DBMS' for each store:
MySQL 8.x | MariaDB | SQLite | PostgreSQL | |
---|---|---|---|---|
Asagi | utf8mb4_general_ci ) [1] |
✅ (utf8mb4_general_ci ) |
❌ | ❌ |
Hayden | ✅ (utf8mb4_0900_ai_ci ) |
utf8mb4_general_ci ) [2] |
✅ | ❌ |
[1]: Asagi databases are scaffolded using a schema designed for MariaDB. This may or may not work for MySQL; YMMV and you might have to modify it a bit. Let me know how you go
[2]: Hayden datastore was designed for MySQL and may not work for MariaDB. utf8mb4_general_ci
is listed as a generic recommendation and is untested; Hayden does not perform any text searches and as such utf8mb4_bin
may be more suitable. You'll have to figure out what works best for you
Here's an example excerpt
[19/10/2021 3:47:34 AM] 4 threads have been queued total
[19/10/2021 3:47:34 AM] [Thread] /vg/11111 +(2/4) [2/1/4]
[19/10/2021 3:47:35 AM] [Image] [2/0]
[19/10/2021 3:47:35 AM] [Thread] /trash/2222 +(2/1) [2/2/4]
[19/10/2021 3:47:35 AM] [Image] [4/0]
[19/10/2021 3:47:35 AM] [Thread] /tg/333333 +(0/1) [0/3/4]
[19/10/2021 3:47:36 AM] [Thread] /tg/444444 N +(0/0) [0/4/4]
[19/10/2021 3:47:36 AM]
Hayden will periodically poll each board and determine which threads need to be re-polled and enqueue them.
Each [Thread]
line can be read as such:
[Thread] /board/00000 +(1/2) [3/4/5]
00000: The thread ID
1: The amount of new images to download from this thread
2: The count of new posts in the thread, subtracted by the count of deleted posts
3: The total amount of images that are currently queued for downloading
4: The amount of threads that have been polled
5: The total amount of threads that need to be polled
Sometimes they have a letter before the +
symbol. This indicates the status of the thread:
A
: Thread has been archived (and will no longer be polled)D
: Thread has been deleted / pruned on an archive-less board (and will no longer be polled)N
: Thread has not changed since the last time it was polled, and the returned data will be ignoredS
: Thread has been skipped from archival (and blacklisted) because it did not satisfy the combination of filters for the boardE
: Hayden has encountered an internal error attempting to process this thread, and as such will retry it on the next board scrape loop
Likewise for the [Image]
lines:
[Image] [1/2]
1: Amount of images yet to be downloaded
2: The total amount of images downloaded during this board scrape cycle