7

Question about wget, subfolder, and index.html.

Let's say I am inside "travels/" folder and this is in "website.com": "website.com/travels/".

Folder "travels/" contains a lot of files and other (sub)folders: "website.com/travels/list.doc" , "website.com/travels/cover.png" , "website.com/travels/[1990] America/" , "website.com/travels/[1994] Japan/", and so on...

How can I download solely all ".mov" and ".jpg" that resides in all the subfolders only? I don't want to pick files from "travels/" (e.g. not "website.com/travels/list.doc")

I found a wget command (on Unix&Linux Exchange, I don't remember what was the discussion) capable of downloading from subfolders only their "index.html", not others contents. Why download only index files?

8
  • Hi @T. Caio would you please correct your link. it seems not the correct one! Commented Sep 21, 2018 at 13:28
  • Hi @Goro, what link should I correct? Sorry, I'm not english-speaker and I'm quite new to Linux Commented Sep 21, 2018 at 13:34
  • In the question you said Here on https://unix.stackexchange.com ... there is no question about wget in this link! you probably copy/paste the unix website link Commented Sep 21, 2018 at 13:34
  • So you would like to know how to download (only) images an videos from a website subfolders, is this correct? Commented Sep 21, 2018 at 13:36
  • @Goro Correct! The subfolders are more than one Commented Sep 21, 2018 at 13:39

1 Answer 1

9

This command will download only images and movies from a given website:

wget -nd -r -P /save/location -A jpeg,jpg,bmp,gif,png,mov "http://www.somedomain.com"

According to wget man:

-nd prevents the creation of a directory hierarchy (i.e. no directories).

-r enables recursive retrieval. See Recursive Download for more information.

-P sets the directory prefix where all files and directories are saved to.

-A sets a whitelist for retrieving only certain file types. Strings and patterns are accepted, and both can be used in a comma separated list (as seen above). See Types of Files for more information.

If you would like to download subfolders you need to use the flag --no-parent, something similar to this command:

wget -r -l1 --no-parent -P /save/location -A jpeg,jpg,bmp,gif,png,mov "http://www.somedomain.com"

-r: recursive retrieving
-l1: sets the maximum recursion depth to be 1
--no-parent: does not ascend to the parent; only downloads from the specified subdirectory and downwards hierarchy

Regarding the index.html webpage. It will be excluded once the flag -A is included in the command wget, because this flag will force wget to download specific type of files, meaning if html is not included in the list of accepted files to be downloaded (i.e. flag A), then it will not be downloaded and wget will output in terminal the following message:

Removing /save/location/default.htm since it should be rejected.

wget can download specific type of files e.g. (jpg, jpeg, png, mov, avi, mpeg, .... etc) when those files are exist in the URL link provided to wget for example:

Let's say we would like to download .zip and .chd files from this website

In this link there are folders and .zip files (scroll to the end). Now, let's say we would like to run this command:

wget -r --no-parent -P /save/location -A chd,zip "https://archive.org/download/MAME0.139_MAME2010_Reference_Set_ROMs_CHDs_Samples/roms/"

This command will download .zip files and at the same time it will create an empty folders for the .chd files.

In order to download the .chd files, we would need to extract the names of the empty folders, then convert those folder names to its actual URLs. Then, put all the URLs of interest in a text file file.txt, finally feed this text file to wget, as follows:

wget -r --no-parent -P /save/location -A chd,zip -i file.txt

The previous command will find all the chd files.

12
  • wget -r -l2 --no-parent -P /my/local/path/ -A jpg https://website.com/remotefolder/ is NOT working (for my needs). wget "entered" in all subfolders, but for each one it only downloaded respective "index.html" files (removing them because rejected). It didn't even try to download further contents! Commented Sep 21, 2018 at 17:52
  • @Guru with your last try, wget keep continuously on entering in all the subfolders but the only thing it do is to (try) to download "index.html" (rejected), only that. Nothing more. Seems like wget is blind to what's inside in every subfolders... Commented Sep 21, 2018 at 20:12
  • @Guru I tried omitting the -A option: only "index.html" is downloaded no other files (neither sub-subfolders). GNU Wget 1.19.5 on Arch Linux x86_64 Commented Sep 21, 2018 at 20:19
  • wget -r --no-parent -P /local/path -A chd https://archive.org/download/MAME0.139_MAME2010_Reference_Set_ROMs_CHDs_Samples/roms/ Commented Sep 21, 2018 at 20:25
  • 1
    Yes! Worked! I am left with empty "somename.zip" folders corresponding to the *.zip inside the parent. Then I also have all the subfolders that I need, each one with with its *.chd content! wget -r --no-parent -P /local/path/ -A chd https://archive.org/download/MAME0.139_MAME2010_Reference_Set_ROMs_CHDs_Samples/roms/ Commented Sep 22, 2018 at 7:06

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.