Commit graph

121 commits

Author SHA1 Message Date
Kenneth Gitere
75018894ae Add regexes module in moz_readability that contains the regular
expressions used. For optimal performance, the regular expresions
are compiled to static values to prevent recompiling in loops
2020-10-15 22:25:10 +03:00
Kenneth Gitere
d2bd31dc47 Add helper functions for the grabArticle function 2020-10-07 20:46:08 +03:00
Kenneth Gitere
87ff21b676 Add regex and lazy_static crates 2020-10-07 20:44:35 +03:00
Kenneth Gitere
7219198524 Change function signature of next_element to return an Option
rather than mutate a given value.

The new function signature reads a little easier than before.
Remove TODO task in replace_brs
2020-09-23 22:52:07 +03:00
Kenneth Gitere
7fb09130e8 Add calls to remove_scripts and prep_document 2020-08-31 20:40:37 +03:00
Kenneth Gitere
e1debf5630 Add moz_readability initial code and accompanying unit tests
This currently contains the preprocessing code of the Readability.
It is a port of Readability.js by Mozilla.
2020-08-31 19:30:09 +03:00
Kenneth Gitere
a27e45b5f3 Merge branch 'master' into dev 2020-05-16 10:35:47 +03:00
Kenneth Gitere
5e7cf7ddfe Fixed img resolving bug 2020-05-16 10:32:36 +03:00
Kenneth Gitere
6dab011cac Fixed img resolving bug 2020-05-16 10:22:49 +03:00
Kenneth Gitere
9f56c58dd9 Add simple CLI wrapper 2020-05-16 10:09:44 +03:00
Kenneth Gitere
c30d5f732e Fix test data 2020-05-06 14:01:49 +03:00
Kenneth Gitere
271d3c8951 Change download code to save images to a folder
Add downloaded images to the output epub file
2020-05-05 12:24:11 +03:00
Kenneth Gitere
f02973157d Refactor downloading code to download images in parallel 2020-05-05 09:40:44 +03:00
Kenneth Gitere
4e8812c1ee Add first attempt to save an epub file 2020-05-02 19:25:31 +03:00
Kenneth Gitere
e5a318282d Update img tags with new src values to point to the local files 2020-05-02 19:06:03 +03:00
Kenneth Gitere
78ba40f57a Add image download functionality 2020-05-02 18:33:45 +03:00
Kenneth Gitere
f24e72e70f Change signature of extract_content to copy the reference to article DOM
node instead of writing to file
2020-05-02 14:51:53 +03:00
Kenneth Gitere
529704d227 Add test for extract content 2020-05-01 20:42:41 +03:00
Kenneth Gitere
b5336e078d Factor out text extraction into extractor module 2020-05-01 16:17:59 +03:00
Kenneth Gitere
4527fb07d9 Initial extraction code to get meta information on a blog 2020-04-30 11:05:53 +03:00
Kenneth Gitere
52f272f586
Initial commit 2020-04-30 08:06:07 +03:00