Kenneth Gitere
21c3ffd922
Refactor fetch_url
...
This adds:
- More validation of responses to ensure the HTML response is valid.
- Better handling of redirecting URLs which allows for fetching of
links proxied to Medium.
2021-01-24 17:52:31 +03:00
Kenneth Gitere
1dc7b3432b
Bug fixes
...
The bug fixes include:
- `<html>` nodes being added to the replaced image when `unwrap_noscript_tags`
is called.
- Remove `srcset` attribute of <img> tags after downloading the image. This
prevented readers like Foliate from displaying the downloaded image
2021-01-12 10:27:46 +03:00
Kenneth Gitere
8407c613df
Bug fixes
...
- Prevent downloading images with base64 strings as the source
- Add escaping of quotation characters in the serializer
- Disable redirects when downloading images which fails on multiple sites
- Remove invalid characters for making the epub export file name
- Fix version number in release
2020-12-24 14:03:36 +03:00
Kenneth Gitere
3bfa82ba60
Update README and version
2020-11-24 18:39:51 +03:00
Kenneth Gitere
725c73c83f
Add basic redirect provided by surf and early exit of the program if the response is not a 200
2020-11-24 18:31:16 +03:00
Kenneth Gitere
5f99bddc10
Add custom serializer for XHTML
2020-11-24 14:54:23 +03:00
Kenneth Gitere
37cb4e1fd2
Change from structopt to clap
...
This allows printing the help message if no args are passed
2020-11-24 09:58:50 +03:00
Kenneth Gitere
cdfbc2b3f6
Refactor inline_css_str_to_map to use a better tokenizer
2020-11-24 08:29:00 +03:00
Kenneth Gitere
aff4054ca9
Update crates and fix bugs
...
The bug fixes are for:
- <base> elements with "/" as the href
- articles containing an ampersand in the title which would create
corrupted manifest files.
2020-11-23 15:55:58 +03:00
Kenneth Gitere
ef3efdba81
Refactor to use temp directory and update surf
...
Change from using res directory for image downloads to using temp directories.
Update surf to v2 which required changing the way Content-Type headers are
read from.
2020-11-23 13:38:58 +03:00
Kenneth Gitere
ab800d0174
Bug fix and add printing of the name of the extracted EPUB
...
The fix prevents creating the res directory if it already exists
2020-11-23 09:06:13 +03:00
Kenneth Gitere
b0e402d685
Resize logo
2020-10-24 08:20:47 +03:00
Kenneth Gitere
566c3427be
Merge pull request #1 from hipstermojo/readability
...
Add Readability port
2020-10-22 19:24:31 +03:00
Kenneth Gitere
be48cc1e47
Fix alignment in README
...
Update manifest file
Add fix in serialized file to have self closing tags which is invalid
xhtml
2020-10-22 19:18:18 +03:00
Kenneth Gitere
6aef1631e3
Add README
2020-10-22 16:03:57 +03:00
Kenneth Gitere
1b4c4ee658
Change CLI option to allow for multiple arguments
...
Add basic looping in async runtime
2020-10-22 15:22:56 +03:00
Kenneth Gitere
db11e78d8c
Add template for epub output
...
Change output format to name file with the title name
Add getters in MetaData
2020-10-22 13:55:02 +03:00
Kenneth Gitere
703de7e3bf
Merge the readability module with the rest of the extractor
2020-10-22 12:12:30 +03:00
Kenneth Gitere
679bf3cb04
Add logic for attempting different rounds for content extraction
...
with different flags set
Add additional test in `fix_relative_uris`
2020-10-22 11:50:34 +03:00
Kenneth Gitere
a0f69ccf80
Fix bug in is_probably_visible
...
Add fix in `grab_article` when appending nodes. This internally
detaches children so it can end up running only once
2020-10-22 11:37:02 +03:00
Kenneth Gitere
a94798cc95
Add flags for conditional cleaning and removal of nodes
...
This also includes updating the function signatures of the affected
methods
2020-10-22 08:24:46 +03:00
Kenneth Gitere
f17c9bfbc9
Add bug fixes for overflows in subtraction, giving a default for
...
capture groups and in extracting nodes. Add fix in `is_probably_visible`
2020-10-21 20:48:21 +03:00
Kenneth Gitere
350447d1c4
Change calls on replacing regexes to replace_all
...
Add `fix_relative_uris`, `clean_classes`, `clean_readability_attrs`
and `post_process_content`
2020-10-21 19:55:22 +03:00
Kenneth Gitere
aacb442b7a
Move MetaAttr to moz_readability
and rename to MetaData
...
Add get_article_metadata, get_article_title and unescape_html_entities
and their tests
2020-10-20 22:27:40 +03:00
Kenneth Gitere
d99b1c687b
Fix counting of h2 nodes in prep_article
...
Add test for prep_article
2020-10-20 10:13:34 +03:00
Kenneth Gitere
94fa8db218
Fix bug in deletion of multiple nodes.
...
When calling `detach` in a for loop or `for_each` iterator consumer,
only the first node is ever deleted.
Fix replacement of table nodes in prep_article
Edit clean_conditionally to remove unnecessary assignment.
2020-10-20 10:04:12 +03:00
Kenneth Gitere
ccdbbb5a16
Add initial implementation of grabArticle
...
Change function signature of setNodeTag to return a NodeRef
Minor fix in clean, clean_headers and clean_conditionally
2020-10-20 07:42:32 +03:00
Kenneth Gitere
3254064c0d
Fix calls to select
to return an iterator excluding the original
...
calling node.
Edit `next_element` to either return an element node only or element/
text node
2020-10-17 07:13:39 +03:00
Kenneth Gitere
6377c01fb3
Add tests for clean_conditionally
and fix_lazy_images
...
Minor refactor in `fix_lazy_images`
Fix incorrect boolean expression and bug in element node name comparison
in `clean_conditionally`
2020-10-16 08:03:01 +03:00
Kenneth Gitere
78d6e16618
Add unit tests for clean
, clean_styles
, clean_headers
and
...
`clean_matched_nodes`
Add missing function calls in `prep_article`
2020-10-16 08:00:47 +03:00
Kenneth Gitere
b661211f0f
Refactored code to use regexes from regexes module
...
Extracted constants from the code for easier reusability in some cases.
Change select queries for multiple elements to use the `,` operator
instead of calling `chain`.
Remove check for "null" in `fix_lazy_images`. This mitigates a JSOM
issue so it doesn't affect the Rust code in any way.
2020-10-15 22:45:18 +03:00
Kenneth Gitere
75018894ae
Add regexes module in moz_readability that contains the regular
...
expressions used. For optimal performance, the regular expresions
are compiled to static values to prevent recompiling in loops
2020-10-15 22:25:10 +03:00
Kenneth Gitere
d2bd31dc47
Add helper functions for the grabArticle function
2020-10-07 20:46:08 +03:00
Kenneth Gitere
87ff21b676
Add regex and lazy_static crates
2020-10-07 20:44:35 +03:00
Kenneth Gitere
7219198524
Change function signature of next_element
to return an Option
...
rather than mutate a given value.
The new function signature reads a little easier than before.
Remove TODO task in replace_brs
2020-09-23 22:52:07 +03:00
Kenneth Gitere
7fb09130e8
Add calls to remove_scripts and prep_document
2020-08-31 20:40:37 +03:00
Kenneth Gitere
e1debf5630
Add moz_readability initial code and accompanying unit tests
...
This currently contains the preprocessing code of the Readability.
It is a port of Readability.js by Mozilla.
2020-08-31 19:30:09 +03:00
Kenneth Gitere
a27e45b5f3
Merge branch 'master' into dev
2020-05-16 10:35:47 +03:00
Kenneth Gitere
5e7cf7ddfe
Fixed img resolving bug
2020-05-16 10:32:36 +03:00
Kenneth Gitere
6dab011cac
Fixed img resolving bug
2020-05-16 10:22:49 +03:00
Kenneth Gitere
9f56c58dd9
Add simple CLI wrapper
2020-05-16 10:09:44 +03:00
Kenneth Gitere
c30d5f732e
Fix test data
2020-05-06 14:01:49 +03:00
Kenneth Gitere
271d3c8951
Change download code to save images to a folder
...
Add downloaded images to the output epub file
2020-05-05 12:24:11 +03:00
Kenneth Gitere
f02973157d
Refactor downloading code to download images in parallel
2020-05-05 09:40:44 +03:00
Kenneth Gitere
4e8812c1ee
Add first attempt to save an epub file
2020-05-02 19:25:31 +03:00
Kenneth Gitere
e5a318282d
Update img tags with new src values to point to the local files
2020-05-02 19:06:03 +03:00
Kenneth Gitere
78ba40f57a
Add image download functionality
2020-05-02 18:33:45 +03:00
Kenneth Gitere
f24e72e70f
Change signature of extract_content
to copy the reference to article DOM
...
node instead of writing to file
2020-05-02 14:51:53 +03:00
Kenneth Gitere
529704d227
Add test for extract content
2020-05-01 20:42:41 +03:00
Kenneth Gitere
b5336e078d
Factor out text extraction into extractor module
2020-05-01 16:17:59 +03:00