paperoni/src/main.rs

#[macro_use]
extern crate lazy_static;

use async_std::stream;
use async_std::task;
use futures::stream::StreamExt;
use url::Url;

mod cli;
mod epub;
mod errors;
mod extractor;
/// This module is responsible for async HTTP calls for downloading
/// the HTML content and images
mod http;
mod moz_readability;

use cli::AppConfig;
use epub::generate_epubs;
use extractor::Extractor;
use http::{download_images, fetch_html};

fn main() {
    let app_config = cli::cli_init();

    if !app_config.urls().is_empty() {
        download(app_config);
    }
}

fn download(app_config: AppConfig) {
    let articles = task::block_on(async {
        let urls_iter = app_config.urls().iter().map(|url| fetch_html(url));
        let mut responses = stream::from_iter(urls_iter).buffered(app_config.max_conn());
        let mut articles = Vec::new();
        while let Some(fetch_result) = responses.next().await {
            match fetch_result {
                Ok((url, html)) => {
                    println!("Extracting");
                    let mut extractor = Extractor::from_html(&html);
                    extractor.extract_content(&url);

                    if extractor.article().is_some() {
                        extractor.extract_img_urls();

                        if let Err(img_errors) =
                            download_images(&mut extractor, &Url::parse(&url).unwrap()).await
                        {
                            eprintln!(
                                "{} image{} failed to download for {}",
                                img_errors.len(),
                                if img_errors.len() > 1 { "s" } else { "" },
                                url
                            );
                        }
                        articles.push(extractor);
                    }
                }
                Err(e) => eprintln!("{}", e),
            }
        }
        articles
    });
    match generate_epubs(articles, app_config.merged()) {
        Ok(_) => (),
        Err(e) => eprintln!("{}", e),
    };
}
Add regexes module in moz_readability that contains the regular expressions used. For optimal performance, the regular expresions are compiled to static values to prevent recompiling in loops 2020-10-12 19:33:01 +01:00			`#[macro_use]`
			`extern crate lazy_static;`

Refactor downloading of HTML pages This change allows for parallel downloads of HTML pages upto a maximum number of concurrent HTTP requests which is more efficient than before where all HTTP requests are likely to begin at the same time. 2021-02-06 14:03:02 +00:00			`use async_std::stream;`
Refactor to use temp directory and update surf Change from using res directory for image downloads to using temp directories. Update surf to v2 which required changing the way Content-Type headers are read from. 2020-11-23 06:39:56 +00:00			`use async_std::task;`
Refactor downloading of HTML pages This change allows for parallel downloads of HTML pages upto a maximum number of concurrent HTTP requests which is more efficient than before where all HTTP requests are likely to begin at the same time. 2021-02-06 14:03:02 +00:00			`use futures::stream::StreamExt;`
Add image download functionality 2020-05-02 16:33:45 +01:00			`use url::Url;`
Initial extraction code to get meta information on a blog 2020-04-30 09:05:53 +01:00
Add simple CLI wrapper 2020-05-16 08:09:44 +01:00			`mod cli;`
Add http and epub modules 2021-02-06 09:59:03 +00:00			`mod epub;`
Add custom error types and ignore failed image downloads Using this custom error type, many instances of unwrap are replaced with mapping to errors that are then logged in main.rs. This allows paperoni to stop crashing when downloading articles when the errors are possibly recoverable or should not affect other downloads. This subsequently introduces ignoring the failed image downloads and instead leaving the original URLs intact. 2021-04-17 10:04:06 +01:00			`mod errors;`
Factor out text extraction into extractor module 2020-05-01 14:17:59 +01:00			`mod extractor;`
Add http and epub modules 2021-02-06 09:59:03 +00:00			`/// This module is responsible for async HTTP calls for downloading`
			`/// the HTML content and images`
			`mod http;`
Add moz_readability initial code and accompanying unit tests This currently contains the preprocessing code of the Readability. It is a port of Readability.js by Mozilla. 2020-08-31 17:30:09 +01:00			`mod moz_readability;`
Factor out text extraction into extractor module 2020-05-01 14:17:59 +01:00
Refactor downloading of HTML pages This change allows for parallel downloads of HTML pages upto a maximum number of concurrent HTTP requests which is more efficient than before where all HTTP requests are likely to begin at the same time. 2021-02-06 14:03:02 +00:00			`use cli::AppConfig;`
Add support for merging articles into a single epub This is still experimental as it lacks validation of the target file name 2021-02-11 10:51:21 +00:00			`use epub::generate_epubs;`
Refactor downloading of HTML pages This change allows for parallel downloads of HTML pages upto a maximum number of concurrent HTTP requests which is more efficient than before where all HTTP requests are likely to begin at the same time. 2021-02-06 14:03:02 +00:00			`use extractor::Extractor;`
Minor refactor Change cli to grab version from the Cargo manifest Rename fetch_url to fetch_html 2021-04-17 10:08:24 +01:00			`use http::{download_images, fetch_html};`
Add http and epub modules 2021-02-06 09:59:03 +00:00
Initial extraction code to get meta information on a blog 2020-04-30 09:05:53 +01:00			`fn main() {`
Add http and epub modules 2021-02-06 09:59:03 +00:00			`let app_config = cli::cli_init();`
Add -f flag for adding links from a file instead of needing to use cat 2021-02-01 08:28:07 +00:00
Add http and epub modules 2021-02-06 09:59:03 +00:00			`if !app_config.urls().is_empty() {`
Refactor downloading of HTML pages This change allows for parallel downloads of HTML pages upto a maximum number of concurrent HTTP requests which is more efficient than before where all HTTP requests are likely to begin at the same time. 2021-02-06 14:03:02 +00:00			`download(app_config);`
Add simple CLI wrapper 2020-05-16 08:09:44 +01:00			`}`
			`}`

Refactor downloading of HTML pages This change allows for parallel downloads of HTML pages upto a maximum number of concurrent HTTP requests which is more efficient than before where all HTTP requests are likely to begin at the same time. 2021-02-06 14:03:02 +00:00			`fn download(app_config: AppConfig) {`
Add support for merging articles into a single epub This is still experimental as it lacks validation of the target file name 2021-02-11 10:51:21 +00:00			`let articles = task::block_on(async {`
Minor refactor Change cli to grab version from the Cargo manifest Rename fetch_url to fetch_html 2021-04-17 10:08:24 +01:00			`let urls_iter = app_config.urls().iter().map(\|url\| fetch_html(url));`
Refactor downloading of HTML pages This change allows for parallel downloads of HTML pages upto a maximum number of concurrent HTTP requests which is more efficient than before where all HTTP requests are likely to begin at the same time. 2021-02-06 14:03:02 +00:00			`let mut responses = stream::from_iter(urls_iter).buffered(app_config.max_conn());`
Add support for merging articles into a single epub This is still experimental as it lacks validation of the target file name 2021-02-11 10:51:21 +00:00			`let mut articles = Vec::new();`
Refactor downloading of HTML pages This change allows for parallel downloads of HTML pages upto a maximum number of concurrent HTTP requests which is more efficient than before where all HTTP requests are likely to begin at the same time. 2021-02-06 14:03:02 +00:00			`while let Some(fetch_result) = responses.next().await {`
			`match fetch_result {`
Refactor fetch_url This adds: - More validation of responses to ensure the HTML response is valid. - Better handling of redirecting URLs which allows for fetching of links proxied to Medium. 2021-01-24 14:49:42 +00:00			`Ok((url, html)) => {`
			`println!("Extracting");`
			`let mut extractor = Extractor::from_html(&html);`
			`extractor.extract_content(&url);`
Add http and epub modules 2021-02-06 09:59:03 +00:00
Refactor fetch_url This adds: - More validation of responses to ensure the HTML response is valid. - Better handling of redirecting URLs which allows for fetching of links proxied to Medium. 2021-01-24 14:49:42 +00:00			`if extractor.article().is_some() {`
Refactor downloading of HTML pages This change allows for parallel downloads of HTML pages upto a maximum number of concurrent HTTP requests which is more efficient than before where all HTTP requests are likely to begin at the same time. 2021-02-06 14:03:02 +00:00			`extractor.extract_img_urls();`
Add custom error types and ignore failed image downloads Using this custom error type, many instances of unwrap are replaced with mapping to errors that are then logged in main.rs. This allows paperoni to stop crashing when downloading articles when the errors are possibly recoverable or should not affect other downloads. This subsequently introduces ignoring the failed image downloads and instead leaving the original URLs intact. 2021-04-17 10:04:06 +01:00
			`if let Err(img_errors) =`
			`download_images(&mut extractor, &Url::parse(&url).unwrap()).await`
			`{`
			`eprintln!(`
			`"{} image{} failed to download for {}",`
			`img_errors.len(),`
			`if img_errors.len() > 1 { "s" } else { "" },`
			`url`
			`);`
			`}`
Add support for merging articles into a single epub This is still experimental as it lacks validation of the target file name 2021-02-11 10:51:21 +00:00			`articles.push(extractor);`
Refactor fetch_url This adds: - More validation of responses to ensure the HTML response is valid. - Better handling of redirecting URLs which allows for fetching of links proxied to Medium. 2021-01-24 14:49:42 +00:00			`}`
Change CLI option to allow for multiple arguments Add basic looping in async runtime 2020-10-22 13:22:56 +01:00			`}`
Add flag for configuring maximum concurrent requests Change printing macro for error messages to go out to stderr 2021-02-21 09:40:17 +00:00			`Err(e) => eprintln!("{}", e),`
Merge the readability module with the rest of the extractor 2020-10-22 10:12:30 +01:00			`}`
Change download code to save images to a folder Add downloaded images to the output epub file 2020-05-05 10:24:11 +01:00			`}`
Add support for merging articles into a single epub This is still experimental as it lacks validation of the target file name 2021-02-11 10:51:21 +00:00			`articles`
			`});`
Add custom error types and ignore failed image downloads Using this custom error type, many instances of unwrap are replaced with mapping to errors that are then logged in main.rs. This allows paperoni to stop crashing when downloading articles when the errors are possibly recoverable or should not affect other downloads. This subsequently introduces ignoring the failed image downloads and instead leaving the original URLs intact. 2021-04-17 10:04:06 +01:00			`match generate_epubs(articles, app_config.merged()) {`
			`Ok(_) => (),`
			`Err(e) => eprintln!("{}", e),`
			`};`
Initial extraction code to get meta information on a blog 2020-04-30 09:05:53 +01:00			`}`