-
So glad I found this great script, thanks very much. I have a website, where the subpages (the content I'm after) are all links on the front page. I'm using PowerShell on Windows, and just stuck at the last bit (no xargs in PowerShell). I can use Python (BeautifulSoup) to scrape my pages and build valid URLs from the relative links. They are in a text file, say list.txt and I want to merge them into a single epub:
But I can't work out how to pass this file - or it's contents - to percollate . I've tried: $list = (get-content -path list.txt)
percollate epub --output="test01.epub" $list Also It doesn't 'parse' them as such, and appears to be treating all 3 as one big long single URL which looks ok (has spaces) Any hints? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
Hi @dwids, I don’t have experience with PowerShell, so I can only be of limited help. This page suggests two approaches:
If there’s a way to confirm that either of these produce a valid list of command-line arguments, it means we then need to investigate at the Node.js / percollate level. |
Beta Was this translation helpful? Give feedback.
-
G'day @danburzo. Thanks very much. Firstly I'm going to re-try using the Linux-under-Windows (WSL2) later today, so I'll focus on that. Lower priority: As for Windows, pe se, I had an idea that it could be the way PowerShell passes (even parses) things - like it's $variables - when talking to Node.js / percollate. Two simple tests seems to imply an issue (?) # Base case; go 'manual' ; enter the 4 URLs individually
percollate pdf --output="Blog Test 01.pdf" https://sidwell.id.au/2020/05/11/hanging-rock-and-me/ https://sidwell.id.au/2020/06/04/first-australian-stories-myths-or-not/ https://sidwell.id.au/2020/05/30/its-all-relative-mummy-c-2004/ https://sidwell.id.au/2020/05/29/computer-things-going-wrong-01/ That worked; a valid PDF with 4 'chapters' (including a valid TOC) Now just load the 4 urls into a PowerShell variable ($urls) and drop that into the command $urls = ' https://sidwell.id.au/2020/05/11/hanging-rock-and-me/ https://sidwell.id.au/2020/06/04/first-australian-stories-myths-or-not/ https://sidwell.id.au/2020/05/30/its-all-relative-mummy-c-2004/ https://sidwell.id.au/2020/05/29/computer-things-going-wrong-01/'
percollate pdf --output="Blog Test 02.pdf" $urls And it seems to see the 4 URLs as 'one':
Blog Test02.pdf only had the content from the final URL computer-things-going-wrong-01 but listed the source as being from the 4 of them: |
Beta Was this translation helpful? Give feedback.
-
Thanks! I re-tried that exact command on Windows 11
And it repeated the above; only added the final URL's contents and listed the '4 in 1' Source inside the epub. But I've got better news on WSL2. Will document here next |
Beta Was this translation helpful? Give feedback.
-
tl;dr :-) Percollate under Windows Subsystem for Linux 2 (WSL2) worked first time. I used the standard Windows Terminal. WSL2 is running Ubuntu 22.04.4 LTS Summary:
Jumped to my Windows D: drive folder - from earlier test - with my url list text files. It worked first time
The epub has the correct 4 'chapters'; one per blog page. Very happy! Thanks again. |
Beta Was this translation helpful? Give feedback.
tl;dr :-) Percollate under Windows Subsystem for Linux 2 (WSL2) worked first time. I used the standard Windows Terminal. WSL2 is running Ubuntu 22.04.4 LTS
Summary:
Jumped to my Windows D: drive folder - from earlier test - with my url list text files. It worked first time
The epub has the correct 4 'chapters'; one per blog page. Very happy! Thanks again.