Also add -i which lets wget read URLs from a file. In particular wget -i - which...

quyse · on Sept 4, 2023

`--config filename` allows this. `--config -` for stdin. Not only urls, but any config options

  echo '--url https://google.com/' | curl --config -

dspillett · on Sept 4, 2023

xargs doesn't have to wait, you can specify the number of items to include in a single sub-command and it'll batch things as they come in. For instance:

    ds@swann3:~# (for x in {1..100}; do sleep 0.1s; echo $x >&2; echo $x; done) | xargs -L5 echo
    1
    2
    3
    4
    5
    1 2 3 4 5
    6
    7
    8
    9
    10
    6 7 8 9 10
    11
    12
    [... and so on ...]

If the xargs call uses -I then --max-lines=1 is implied anyway.

If you replace echo with something that sleeps you'll see that the pipe doesn't stall waiting on xargs so the process producing the list can keep pushing new items to it as they are found:

    ds@swann3:~# (for x in {1..100}; do sleep 0.1s; echo $x >&2; echo $x; done) | xargs --max-lines=5 ./echosleepecho
    1
    2
    3
    4
    5
    starting 1 2 3 4 5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    done 1 2 3 4 5
    starting 6 7 8 9 10
    15
    16
    17
    18
    19
    [... and so on until ...]
    98
    99
    100
    done 46 47 48 49 50
    sleeping for 51 52 53 54 55
    done 51 52 53 54 55
    sleeping for 56 57 58 59 60
    [... and so on until xarg's stdin is exhausted]

And you can stop the calls made by xargs being sequential too for more parallelism with the --max-procs option (or use parallel instead of xargs):

    ds@swann3:~# (for x in {1..100}; do sleep 0.1s; echo $x >&2; echo $x; done) | xargs --max-lines=3 --max-procs=10 ./echosleepecho
    1
    2
    3
    sleeping for 1 2 3
    4
    5
    6
    sleeping for 4 5 6
    7
    8
    9
    sleeping for 7 8 9
    10
    11
    12
    sleeping for 10 11 12
    done 1 2 3
    13
    14
    15
    sleeping for 13 14 15
    done 4 5 6
    16
    [... and so on ...]

(I adjusted max-lines in that last example because my current timings made things line up in a manner that made the effect less obvious, adjusting the timings would have been equally valid, in a less artificial example like calling curl to get many resources timings will of course be less regular, perhaps these examples can be improved by randomising the sleeps)

I'm not sure what you would do about error handling in all this though, more experimentation necessary there before I'd ever do this in production!

dspillett · on Sept 5, 2023

Reply to self to add a note of something that coincidentally came up elsewhere¹ and is relevant to the above: of course xargs being able to push existing things forward while the list of actions is still being produced relies on it getting a steam of the list instead of the whole thing in one block. If your earlier stages cause a pipeline stall it can't help you.

For an artificial example, change

    (for x in {1..100}; do sleep 0.1s; echo $x >&2; echo $x; done) | xargs -L5 echo

to

    (for x in {1..100}; do sleep 0.1s; echo $x >&2; echo $x; done) | sort | xargs -L5 echo

The sort command will, by necessity, absorb the list as it is produced and spit it all out at once at the end. xargs can still use multiple processes (if max-procs is used) to make use of concurrency to speed the work, but can't get started until the full list is produced and sorted.

----

[1] An unnecessary sort in an ETL process causing the overall wall-clock time to increase significantly

twic · on Sept 4, 2023

Right, but then you are invoking curl several times, and so not reusing a single connection, as you would with wget -i, so it still loses.

dspillett · on Sept 4, 2023

Valid criticism, ish, but that wasn't in what was previously asked for so well done on being like my day-job clients and failing to specify the problem completely :)

You can specify multiple URLs on the same command in curl so using xargs in this way would do what you ask to an extent (the connection would be dropped and renegotiated at least between each batch) as long as you don't use any options that imply --max-lines=1.

With the --max-procs option you could be requesting multiple files at once which may improve performance over wget -i – though obviously take care doing this against a single site (with wget -i too for that matter) as this can be rather unfriendly (if requesting from multiple hosts this is moot, as is the multiple files-from-one-connection point).