Commit graph

17 commits

Author SHA1 Message Date
Kenneth Gitere
ef3efdba81 Refactor to use temp directory and update surf
Change from using res directory for image downloads to using temp directories.
Update surf to v2 which required changing the way Content-Type headers are
read from.
2020-11-23 13:38:58 +03:00
Kenneth Gitere
ab800d0174 Bug fix and add printing of the name of the extracted EPUB
The fix prevents creating the res directory if it already exists
2020-11-23 09:06:13 +03:00
Kenneth Gitere
be48cc1e47 Fix alignment in README
Update manifest file
Add fix in serialized file to have self closing tags which is invalid
xhtml
2020-10-22 19:18:18 +03:00
Kenneth Gitere
1b4c4ee658 Change CLI option to allow for multiple arguments
Add basic looping in async runtime
2020-10-22 15:22:56 +03:00
Kenneth Gitere
db11e78d8c Add template for epub output
Change output format to name file with the title name
Add getters in MetaData
2020-10-22 13:55:02 +03:00
Kenneth Gitere
703de7e3bf Merge the readability module with the rest of the extractor 2020-10-22 12:12:30 +03:00
Kenneth Gitere
75018894ae Add regexes module in moz_readability that contains the regular
expressions used. For optimal performance, the regular expresions
are compiled to static values to prevent recompiling in loops
2020-10-15 22:25:10 +03:00
Kenneth Gitere
e1debf5630 Add moz_readability initial code and accompanying unit tests
This currently contains the preprocessing code of the Readability.
It is a port of Readability.js by Mozilla.
2020-08-31 19:30:09 +03:00
Kenneth Gitere
9f56c58dd9 Add simple CLI wrapper 2020-05-16 10:09:44 +03:00
Kenneth Gitere
271d3c8951 Change download code to save images to a folder
Add downloaded images to the output epub file
2020-05-05 12:24:11 +03:00
Kenneth Gitere
4e8812c1ee Add first attempt to save an epub file 2020-05-02 19:25:31 +03:00
Kenneth Gitere
e5a318282d Update img tags with new src values to point to the local files 2020-05-02 19:06:03 +03:00
Kenneth Gitere
78ba40f57a Add image download functionality 2020-05-02 18:33:45 +03:00
Kenneth Gitere
f24e72e70f Change signature of extract_content to copy the reference to article DOM
node instead of writing to file
2020-05-02 14:51:53 +03:00
Kenneth Gitere
529704d227 Add test for extract content 2020-05-01 20:42:41 +03:00
Kenneth Gitere
b5336e078d Factor out text extraction into extractor module 2020-05-01 16:17:59 +03:00
Kenneth Gitere
4527fb07d9 Initial extraction code to get meta information on a blog 2020-04-30 11:05:53 +03:00