Kenneth Gitere
ae1ddb9386
Add printing of table for failed article downloads
...
- Map errors in `fetch_html` to include the source url
- Change `article_link` to `article_source`
- Add `Into` conversion for `UTF8Error`
- Collect errors in `generate_epubs` for displaying in a table
2021-04-20 21:33:24 +03:00
Kenneth Gitere
60fb30e8a2
Add url field in Extractor struct
2021-04-20 21:06:54 +03:00
Kenneth Gitere
b217448601
Add printing of tables upon successful extraction
2021-04-20 14:02:56 +03:00
Kenneth Gitere
04a1eed4e2
Add progress indicators for the cli
2021-04-17 17:28:07 +03:00
Kenneth Gitere
217cd3e442
Minor refactor
...
Change cli to grab version from the Cargo manifest
Rename fetch_url to fetch_html
2021-04-17 12:37:53 +03:00
Kenneth Gitere
7e9dcfc2b7
Add custom error types and ignore failed image downloads
...
Using this custom error type, many instances of unwrap are replaced
with mapping to errors that are then logged in main.rs. This allows
paperoni to stop crashing when downloading articles when the errors
are possibly recoverable or should not affect other downloads.
This subsequently introduces ignoring the failed image downloads
and instead leaving the original URLs intact.
2021-04-17 12:04:06 +03:00
Kenneth Gitere
d6cbbe405b
Fix bug in inline_css_str_to_map
2021-04-14 18:07:39 +03:00
Kenneth Gitere
2762bc5086
Merge pull request #7 from hipstermojo/dev
...
Update README
2021-02-24 13:28:56 +03:00
Kenneth Gitere
b8c0cf29f1
Update README
2021-02-24 13:27:43 +03:00
Kenneth Gitere
e9f96d2970
Merge pull request #6 from hipstermojo/dev
...
Update to 0.3.0
2021-02-24 13:13:36 +03:00
Kenneth Gitere
165b2187be
Bump version
2021-02-24 13:03:52 +03:00
Kenneth Gitere
912bc9d915
Add flag for configuring maximum concurrent requests
...
Change printing macro for error messages to go out to stderr
2021-02-21 13:11:26 +03:00
Kenneth Gitere
b0c4c47413
Add support for merging articles into a single epub
...
This is still experimental as it lacks validation of the target file name
2021-02-11 13:51:21 +03:00
Kenneth Gitere
f0a610c2ac
Bug fix with empty titles
...
The code for title retrieval previously assumed that meta tags concerned
with the title would always contain a value but some sites leave the value
empty thus it had to be checked for as well.
2021-02-09 12:56:07 +03:00
Kenneth Gitere
65fdd967c1
Refactor image downloading and update README
...
Image downloads uses streams instead of spawned tasks to ensure that
it does not start an unbounded number of spawned tasks
2021-02-09 10:34:35 +03:00
Kenneth Gitere
003953332f
Refactor downloading of HTML pages
...
This change allows for parallel downloads of HTML pages upto a maximum
number of concurrent HTTP requests which is more efficient than
before where all HTTP requests are likely to begin at the same time.
2021-02-06 17:06:03 +03:00
Kenneth Gitere
6b62051942
Add replace_metadata_value
function
2021-02-06 13:53:04 +03:00
Kenneth Gitere
b402472ba6
Add http and epub modules
2021-02-06 12:59:03 +03:00
Kenneth Gitere
08f847531f
Remove empty lines when reading from an input file
2021-02-03 07:39:51 +03:00
Kenneth Gitere
3d56023592
Add -f flag for adding links from a file instead of needing to use cat
2021-02-01 11:31:24 +03:00
Kenneth Gitere
c82071a871
Merge pull request #5 from hipstermojo/dev
...
Merge 0.2.2-alpha-1
2021-01-24 18:00:50 +03:00
Kenneth Gitere
b98c0a69a6
Bump version
2021-01-24 17:54:33 +03:00
Kenneth Gitere
21c3ffd922
Refactor fetch_url
...
This adds:
- More validation of responses to ensure the HTML response is valid.
- Better handling of redirecting URLs which allows for fetching of
links proxied to Medium.
2021-01-24 17:52:31 +03:00
Kenneth Gitere
1dc7b3432b
Bug fixes
...
The bug fixes include:
- `<html>` nodes being added to the replaced image when `unwrap_noscript_tags`
is called.
- Remove `srcset` attribute of <img> tags after downloading the image. This
prevented readers like Foliate from displaying the downloaded image
2021-01-12 10:27:46 +03:00
Kenneth Gitere
ca1f9e2800
Merge pull request #4 from hipstermojo/dev
...
Update to 0.2.1-alpha1
2020-12-24 14:11:42 +03:00
Kenneth Gitere
8407c613df
Bug fixes
...
- Prevent downloading images with base64 strings as the source
- Add escaping of quotation characters in the serializer
- Disable redirects when downloading images which fails on multiple sites
- Remove invalid characters for making the epub export file name
- Fix version number in release
2020-12-24 14:03:36 +03:00
Kenneth Gitere
3c7dc9a416
Merge pull request #3 from hipstermojo/dev
...
0.2.0 update
2020-11-24 18:42:29 +03:00
Kenneth Gitere
3bfa82ba60
Update README and version
2020-11-24 18:39:51 +03:00
Kenneth Gitere
725c73c83f
Add basic redirect provided by surf and early exit of the program if the response is not a 200
2020-11-24 18:31:16 +03:00
Kenneth Gitere
5f99bddc10
Add custom serializer for XHTML
2020-11-24 14:54:23 +03:00
Kenneth Gitere
37cb4e1fd2
Change from structopt to clap
...
This allows printing the help message if no args are passed
2020-11-24 09:58:50 +03:00
Kenneth Gitere
cdfbc2b3f6
Refactor inline_css_str_to_map to use a better tokenizer
2020-11-24 08:29:00 +03:00
Kenneth Gitere
aff4054ca9
Update crates and fix bugs
...
The bug fixes are for:
- <base> elements with "/" as the href
- articles containing an ampersand in the title which would create
corrupted manifest files.
2020-11-23 15:55:58 +03:00
Kenneth Gitere
ef3efdba81
Refactor to use temp directory and update surf
...
Change from using res directory for image downloads to using temp directories.
Update surf to v2 which required changing the way Content-Type headers are
read from.
2020-11-23 13:38:58 +03:00
Kenneth Gitere
ab800d0174
Bug fix and add printing of the name of the extracted EPUB
...
The fix prevents creating the res directory if it already exists
2020-11-23 09:06:13 +03:00
Kenneth Gitere
b0e402d685
Resize logo
2020-10-24 08:20:47 +03:00
Kenneth Gitere
fbf2f0b3d8
Merge pull request #2 from hipstermojo/dev
...
Merge v0.1.0
2020-10-22 19:25:19 +03:00
Kenneth Gitere
566c3427be
Merge pull request #1 from hipstermojo/readability
...
Add Readability port
2020-10-22 19:24:31 +03:00
Kenneth Gitere
be48cc1e47
Fix alignment in README
...
Update manifest file
Add fix in serialized file to have self closing tags which is invalid
xhtml
2020-10-22 19:18:18 +03:00
Kenneth Gitere
6aef1631e3
Add README
2020-10-22 16:03:57 +03:00
Kenneth Gitere
1b4c4ee658
Change CLI option to allow for multiple arguments
...
Add basic looping in async runtime
2020-10-22 15:22:56 +03:00
Kenneth Gitere
db11e78d8c
Add template for epub output
...
Change output format to name file with the title name
Add getters in MetaData
2020-10-22 13:55:02 +03:00
Kenneth Gitere
703de7e3bf
Merge the readability module with the rest of the extractor
2020-10-22 12:12:30 +03:00
Kenneth Gitere
679bf3cb04
Add logic for attempting different rounds for content extraction
...
with different flags set
Add additional test in `fix_relative_uris`
2020-10-22 11:50:34 +03:00
Kenneth Gitere
a0f69ccf80
Fix bug in is_probably_visible
...
Add fix in `grab_article` when appending nodes. This internally
detaches children so it can end up running only once
2020-10-22 11:37:02 +03:00
Kenneth Gitere
a94798cc95
Add flags for conditional cleaning and removal of nodes
...
This also includes updating the function signatures of the affected
methods
2020-10-22 08:24:46 +03:00
Kenneth Gitere
f17c9bfbc9
Add bug fixes for overflows in subtraction, giving a default for
...
capture groups and in extracting nodes. Add fix in `is_probably_visible`
2020-10-21 20:48:21 +03:00
Kenneth Gitere
350447d1c4
Change calls on replacing regexes to replace_all
...
Add `fix_relative_uris`, `clean_classes`, `clean_readability_attrs`
and `post_process_content`
2020-10-21 19:55:22 +03:00
Kenneth Gitere
aacb442b7a
Move MetaAttr to moz_readability
and rename to MetaData
...
Add get_article_metadata, get_article_title and unescape_html_entities
and their tests
2020-10-20 22:27:40 +03:00
Kenneth Gitere
d99b1c687b
Fix counting of h2 nodes in prep_article
...
Add test for prep_article
2020-10-20 10:13:34 +03:00