Commit graph

18 commits

Author SHA1 Message Date
Kenneth Gitere
679bf3cb04 Add logic for attempting different rounds for content extraction
with different flags set

Add additional test in `fix_relative_uris`
2020-10-22 11:50:34 +03:00
Kenneth Gitere
a0f69ccf80 Fix bug in is_probably_visible
Add fix in `grab_article` when appending nodes. This internally
detaches children so it can end up running only once
2020-10-22 11:37:02 +03:00
Kenneth Gitere
a94798cc95 Add flags for conditional cleaning and removal of nodes
This also includes updating the function signatures of the affected
methods
2020-10-22 08:24:46 +03:00
Kenneth Gitere
f17c9bfbc9 Add bug fixes for overflows in subtraction, giving a default for
capture groups and in extracting nodes. Add fix in `is_probably_visible`
2020-10-21 20:48:21 +03:00
Kenneth Gitere
350447d1c4 Change calls on replacing regexes to replace_all
Add `fix_relative_uris`, `clean_classes`, `clean_readability_attrs`
and `post_process_content`
2020-10-21 19:55:22 +03:00
Kenneth Gitere
aacb442b7a Move MetaAttr to moz_readability and rename to MetaData
Add get_article_metadata, get_article_title and unescape_html_entities
and their tests
2020-10-20 22:27:40 +03:00
Kenneth Gitere
d99b1c687b Fix counting of h2 nodes in prep_article
Add test for prep_article
2020-10-20 10:13:34 +03:00
Kenneth Gitere
94fa8db218 Fix bug in deletion of multiple nodes.
When calling `detach` in a for loop or `for_each` iterator consumer,
only the first node is ever deleted.

Fix replacement of table nodes in prep_article
Edit clean_conditionally to remove unnecessary assignment.
2020-10-20 10:04:12 +03:00
Kenneth Gitere
ccdbbb5a16 Add initial implementation of grabArticle
Change function signature of setNodeTag to return a NodeRef

Minor fix in clean, clean_headers and clean_conditionally
2020-10-20 07:42:32 +03:00
Kenneth Gitere
3254064c0d Fix calls to select to return an iterator excluding the original
calling node.

Edit `next_element` to either return an element node only or element/
text node
2020-10-17 07:13:39 +03:00
Kenneth Gitere
6377c01fb3 Add tests for clean_conditionally and fix_lazy_images
Minor refactor in `fix_lazy_images`
Fix incorrect boolean expression and bug in element node name comparison
in `clean_conditionally`
2020-10-16 08:03:01 +03:00
Kenneth Gitere
78d6e16618 Add unit tests for clean, clean_styles, clean_headers and
`clean_matched_nodes`

Add missing function calls in `prep_article`
2020-10-16 08:00:47 +03:00
Kenneth Gitere
b661211f0f Refactored code to use regexes from regexes module
Extracted constants from the code for easier reusability in some cases.
Change select queries for multiple elements to use the `,` operator
instead of calling `chain`.

Remove check for "null" in `fix_lazy_images`. This mitigates a JSOM
issue so it doesn't affect the Rust code in any way.
2020-10-15 22:45:18 +03:00
Kenneth Gitere
75018894ae Add regexes module in moz_readability that contains the regular
expressions used. For optimal performance, the regular expresions
are compiled to static values to prevent recompiling in loops
2020-10-15 22:25:10 +03:00
Kenneth Gitere
d2bd31dc47 Add helper functions for the grabArticle function 2020-10-07 20:46:08 +03:00
Kenneth Gitere
7219198524 Change function signature of next_element to return an Option
rather than mutate a given value.

The new function signature reads a little easier than before.
Remove TODO task in replace_brs
2020-09-23 22:52:07 +03:00
Kenneth Gitere
7fb09130e8 Add calls to remove_scripts and prep_document 2020-08-31 20:40:37 +03:00
Kenneth Gitere
e1debf5630 Add moz_readability initial code and accompanying unit tests
This currently contains the preprocessing code of the Readability.
It is a port of Readability.js by Mozilla.
2020-08-31 19:30:09 +03:00