Kenneth Gitere
5f99bddc10
Add custom serializer for XHTML
2020-11-24 14:54:23 +03:00
Kenneth Gitere
ef3efdba81
Refactor to use temp directory and update surf
...
Change from using res directory for image downloads to using temp directories.
Update surf to v2 which required changing the way Content-Type headers are
read from.
2020-11-23 13:38:58 +03:00
Kenneth Gitere
db11e78d8c
Add template for epub output
...
Change output format to name file with the title name
Add getters in MetaData
2020-10-22 13:55:02 +03:00
Kenneth Gitere
703de7e3bf
Merge the readability module with the rest of the extractor
2020-10-22 12:12:30 +03:00
Kenneth Gitere
aacb442b7a
Move MetaAttr to moz_readability
and rename to MetaData
...
Add get_article_metadata, get_article_title and unescape_html_entities
and their tests
2020-10-20 22:27:40 +03:00
Kenneth Gitere
6dab011cac
Fixed img resolving bug
2020-05-16 10:22:49 +03:00
Kenneth Gitere
c30d5f732e
Fix test data
2020-05-06 14:01:49 +03:00
Kenneth Gitere
271d3c8951
Change download code to save images to a folder
...
Add downloaded images to the output epub file
2020-05-05 12:24:11 +03:00
Kenneth Gitere
f02973157d
Refactor downloading code to download images in parallel
2020-05-05 09:40:44 +03:00
Kenneth Gitere
e5a318282d
Update img tags with new src values to point to the local files
2020-05-02 19:06:03 +03:00
Kenneth Gitere
78ba40f57a
Add image download functionality
2020-05-02 18:33:45 +03:00
Kenneth Gitere
f24e72e70f
Change signature of extract_content
to copy the reference to article DOM
...
node instead of writing to file
2020-05-02 14:51:53 +03:00
Kenneth Gitere
529704d227
Add test for extract content
2020-05-01 20:42:41 +03:00
Kenneth Gitere
b5336e078d
Factor out text extraction into extractor module
2020-05-01 16:17:59 +03:00