Kenneth Gitere
b661211f0f
Refactored code to use regexes from regexes module
...
Extracted constants from the code for easier reusability in some cases.
Change select queries for multiple elements to use the `,` operator
instead of calling `chain`.
Remove check for "null" in `fix_lazy_images`. This mitigates a JSOM
issue so it doesn't affect the Rust code in any way.
2020-10-15 22:45:18 +03:00
Kenneth Gitere
75018894ae
Add regexes module in moz_readability that contains the regular
...
expressions used. For optimal performance, the regular expresions
are compiled to static values to prevent recompiling in loops
2020-10-15 22:25:10 +03:00
Kenneth Gitere
d2bd31dc47
Add helper functions for the grabArticle function
2020-10-07 20:46:08 +03:00
Kenneth Gitere
7219198524
Change function signature of next_element
to return an Option
...
rather than mutate a given value.
The new function signature reads a little easier than before.
Remove TODO task in replace_brs
2020-09-23 22:52:07 +03:00
Kenneth Gitere
7fb09130e8
Add calls to remove_scripts and prep_document
2020-08-31 20:40:37 +03:00
Kenneth Gitere
e1debf5630
Add moz_readability initial code and accompanying unit tests
...
This currently contains the preprocessing code of the Readability.
It is a port of Readability.js by Mozilla.
2020-08-31 19:30:09 +03:00
Kenneth Gitere
6dab011cac
Fixed img resolving bug
2020-05-16 10:22:49 +03:00
Kenneth Gitere
9f56c58dd9
Add simple CLI wrapper
2020-05-16 10:09:44 +03:00
Kenneth Gitere
c30d5f732e
Fix test data
2020-05-06 14:01:49 +03:00
Kenneth Gitere
271d3c8951
Change download code to save images to a folder
...
Add downloaded images to the output epub file
2020-05-05 12:24:11 +03:00
Kenneth Gitere
f02973157d
Refactor downloading code to download images in parallel
2020-05-05 09:40:44 +03:00
Kenneth Gitere
4e8812c1ee
Add first attempt to save an epub file
2020-05-02 19:25:31 +03:00
Kenneth Gitere
e5a318282d
Update img tags with new src values to point to the local files
2020-05-02 19:06:03 +03:00
Kenneth Gitere
78ba40f57a
Add image download functionality
2020-05-02 18:33:45 +03:00
Kenneth Gitere
f24e72e70f
Change signature of extract_content
to copy the reference to article DOM
...
node instead of writing to file
2020-05-02 14:51:53 +03:00
Kenneth Gitere
529704d227
Add test for extract content
2020-05-01 20:42:41 +03:00
Kenneth Gitere
b5336e078d
Factor out text extraction into extractor module
2020-05-01 16:17:59 +03:00
Kenneth Gitere
4527fb07d9
Initial extraction code to get meta information on a blog
2020-04-30 11:05:53 +03:00