0

Running

curl -s 'https://www.idealista.com/inmueble/94238881/' \
  -H 'authority: www.idealista.com' \
  -H 'cache-control: max-age=0' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'sec-gpc: 1' \
  -H 'sec-fetch-site: same-origin' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-user: ?1' \
  -H 'sec-fetch-dest: document' \
  -H 'referer: https://www.idealista.com/usuario/favoritos/' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H $'cookie: didomi_token=eyJ1c2VyX2lkIjoiMTc5MTljZjItMzUxNi02NmRjLTk0YjYtNTM3ODFiMjY1NGU5IiwiY3JlYXRlZCI6IjIwMjEtMDQtMjhUMTg6NDg6MzIuMDk5WiIsInVwZGF0ZWQiOiIyMDIxLTA0LTI4VDE4OjQ4OjMyLjA5OVoiLCJ2ZW5kb3JzIjp7ImRpc2FibGVkIjpbInR3aXR0ZXIiLCJnb29nbGUiLCJmYWNlYm9vayIsImM6bWl4cGFuZWwiLCJjOmlkZWFsaXN0YS1mZVJFamUyYyIsImM6aWRlYWxpc3RhLUx6dEJlcUUzIiwiYzphYnRhc3R5LUxMa0VDQ2o4IiwiYzpob3RqYXIiLCJjOnlhbmRleG1ldHJpY3MiLCJjOmJlYW1lci1IN3RyN0hpeCIsImM6dGVhbGl1bWNvLURWRENkOFpQIiwiYzpjaGFyYmVhdC1aNFFrOENhaCJdfSwicHVycG9zZXMiOnsiZGlzYWJsZWQiOlsiZ2VvbG9jYXRpb25fZGF0YSIsImFuYWxpdGljYXMtZHlGVkdSZTgiXX0sInZlcnNpb24iOjIsImFjIjoiQUFBQS5BQUFBIn0=; euconsent-v2=CPFYMwBPFYMwBAHABBENBXCgAAAAAAAAAAAAAAAAAAEBoFAAVgAuACGAGQAMsAagA2QB2AD8AIAAQUAjABSwCngFXgLQAtIBrADeAHVAPkAhsBDoCKgEXgJEATYAnYBSIC5AGBAMJAYeAxgBk4DOQGeAM-AckA5QB1hKB6AAgABYAFAAMgAcABFADAAMQAeABEACYAFUALgAXwAxABmADaAIQAQ0AiACJAEcAKMAUoAtwBhADKAGqANkAd4A_ACMAEcAKeAVeAtAC0gF1AMUAbgA4gB1AD5AIdARUAi8BIgCbAFigLYAXaAvMBh4DIgGTgMsAZyAzwBnwDSAGsAOAAdYA7UpBQAAXABQAFQAMgAcgA-AEAAIoAYABjADQANQAeQBDAEUAJgATwApABVACwAFwAL4AYgAzABzAEIAIaARABEgCjAFKALEAW4AwgBlADRAGqANkAd8A-wD9AIsARgAjgBKQCggFDAKuAVsAuYBeQDFAG0ANwAegBDoCLwEiAJsATsAocBTQCtgFigLYAXAAuQBdoC8wGGgMPAYwAyIBkgDJwGXAM5AZ4Az6BpAGkwNYA1kBscDkwOUAcuA6wB2oDxyEEIABYAFAAMgAiABcADEAIYATAAqgBcAC-AGIAMwAbwA9ACOAFiAMIAZQA1ABvgDvgH2AfgA_wCMAEcAJSAUEAoYBTwCrwFoAWkAuYBfgDFAG0AOoAegBIICRAEqAJsAU0AsUBaMC2ALaAXAAuQBdoDDwGJAMiAZOAzkBngDPgGiANJAaWA1UBwADkgHRgOsAdqA8cdBxAAXABQAFQAMgAcgA-AEAAIgAXQAwADGAGgAagA8AB9AEMARQAmABPgCqAKwAWIAuAC6AF8AMQAZgA3gBzAD0AIQAQ0AiACJAEdAJYAmABNACjAFKALEAW8AwgDDAGQAMoAaIA1ABsgDfAHeAPaAfYB-gD_AIHARYBGACOQEpASoAoIBTwCrgFigLQAtIBcwC6gF5AL8AYoA2gBuADiQHTAdQA9ACGwEOgIiARUAi8BIICRAEqAJsATsAocBTQCrAFigLQgWwBbIC4AFyALtAXeAvMBgwDCQGGgMPAYkAxgBjwDJAGTgMqAZYAy4BnIDPgGiQNIA0kBpYDTgGqgNYAbGA4uByQHKgOXAdGA6wB44D0gHqhILYACAAFwAUABUADIAHIAPABAACIAGEANAA1AB5AEMARQAmABPgCqAKwAWAAuABvADmAHoAQgAhoBEAESAI6ASwBLgCaAFKALcAYYAyABlwDUANUAbIA7wB7AD4gH2AfoBAACBwEXARgAjQBHACUgFBAKWAU8Aq4BcwC_AGKANYAbQA3ABvADiAHoAPkAhsBDoCLwEiAJiATKAmwBOwChwFIgLFAWgAtgBcgC7wF5gMCAYMAwkBhoDDwGRAMkAZOAy4BnIDPgGkANOgawBrMDkQOVAcuA6MB1gDxxkBwACgAQwAmABcAEcAMsAagA7IB9gH4ARgAjgBSwCrgFbAN4AmIBNgC0QFsALzAYEAw8BkQDOQGeAM-AckA5QVAfAAoAEMAJgAXABHADLAGoAOwAfgBGACOAFLAKvAWgBaQDeAJBATEAmwBTYC2AFyALzAYEAw8BkQDOQGeAM-AbkA5IBygAA.YAAAAAAAAAAA; smc="{}"; userUUID=29668197-3ec8-4380-83a0-af74497d4652; askToSaveAlertPopUp=true; cookieSearch-1="/areas/venta-terrenos/con-precio-hasta_30000,precio-desde_5000,metros-cuadrados-mas-de_500,metros-cuadrados-menos-de_5000,terrenos-urbanos/?shape=%28%28uyxcF%7Cqxl%40%7BxEyrlCput%40ghjBfnw%40%7Dnr%40rlxA%7Dj_AjdWq%60xAr%60r%40_pUvdy%40rwg%40r%7CLr%7BZmfy%40f%7DsB_%7EwBb_aDyj%7DA%7CsoBupaA%7CaO%29%29:1631778650981"; uc="jNLcI107+7z1wRs0x4TdO3u5jcaE+Wrl/o5Drt4SE9qHCxSMOrDfJCS1OVr9tkKQ1xbkhwtCXOZOKw1BLkvMTAMsML+Z10HjHWdUIRhRkRXsEcNnPFEt9rqCk0DCCd7EBSE6A/jp5vs="; nl="wrtrmuF9QzNOYYO2P8SN3OqyjHQXevAY7aYvx0cKdUNwML7qYn47dSs63/pFStgOTH50K6V1y0hMkNG4T70na63g0fJdDSpgDegfruZFCA9GnVx058kgR638a8Q81Gz9r1nzfAqJdfs="; SESSION=2e11a953aead5c1f~d77af03d-2746-4657-89a6-ab25cb02e889; contactd77af03d-2746-4657-89a6-ab25cb02e889="{\'email\':\'EqU9vfX+DxXdzl+iReclTDED4UB/DY6iSfTeSLvLlsE=\',\'phone\':\'634821160\',\'phonePrefix\':null,\'friendEmails\':null,\'name\':\'RL9PoIsz5dW5rvLNF/0gvA==\',\'message\':null,\'message2Friends\':null,\'maxNumberContactsAllow\':10,\'defaultMessage\':true}"; sendd77af03d-2746-4657-89a6-ab25cb02e889="{\'friendsEmail\':null,\'email\':\'EqU9vfX+DxXdzl+iReclTDED4UB/DY6iSfTeSLvLlsE=\',\'message\':null}"; datadome=Gegr2x~z62hH2U.ay-n4sMd3xgi-RaB1X5XMr4i5qV2q6GYsINszxSNDS732-spxaAaUp.m7aGMcOgN-DcAxFY9KCQsldTsDl-RVS5ocEm; cc=eyJhbGciOiJIUzI1NiJ9.eyJjdCI6OTk2MjMyNiwiZXhwIjoxNjMxOTUxOTc4fQ.0YJZGR34jgb1SWTAzL9DFPhAeeP4k0hE9igRE8-SQp0' \
--compressed | grep -m 1 "<meta name=\"description\" content="

would give me:

<meta name="description" content="terreno de 936 m², Terreno en venta en paseo Blasco Ibáñez s/n, Costa Esuri, Ayamonte, Costa Esuri">

In an attempt to scrape just the text, without any HTML, I trialled applying sed and I ended up with this code that works as expected:

curl -s 'https://www.idealista.com/inmueble/94238881/' \
  -H 'authority: www.idealista.com' \
  -H 'cache-control: max-age=0' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'sec-gpc: 1' \
  -H 'sec-fetch-site: same-origin' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-user: ?1' \
  -H 'sec-fetch-dest: document' \
  -H 'referer: https://www.idealista.com/usuario/favoritos/' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H $'cookie: didomi_token=eyJ1c2VyX2lkIjoiMTc5MTljZjItMzUxNi02NmRjLTk0YjYtNTM3ODFiMjY1NGU5IiwiY3JlYXRlZCI6IjIwMjEtMDQtMjhUMTg6NDg6MzIuMDk5WiIsInVwZGF0ZWQiOiIyMDIxLTA0LTI4VDE4OjQ4OjMyLjA5OVoiLCJ2ZW5kb3JzIjp7ImRpc2FibGVkIjpbInR3aXR0ZXIiLCJnb29nbGUiLCJmYWNlYm9vayIsImM6bWl4cGFuZWwiLCJjOmlkZWFsaXN0YS1mZVJFamUyYyIsImM6aWRlYWxpc3RhLUx6dEJlcUUzIiwiYzphYnRhc3R5LUxMa0VDQ2o4IiwiYzpob3RqYXIiLCJjOnlhbmRleG1ldHJpY3MiLCJjOmJlYW1lci1IN3RyN0hpeCIsImM6dGVhbGl1bWNvLURWRENkOFpQIiwiYzpjaGFyYmVhdC1aNFFrOENhaCJdfSwicHVycG9zZXMiOnsiZGlzYWJsZWQiOlsiZ2VvbG9jYXRpb25fZGF0YSIsImFuYWxpdGljYXMtZHlGVkdSZTgiXX0sInZlcnNpb24iOjIsImFjIjoiQUFBQS5BQUFBIn0=; euconsent-v2=CPFYMwBPFYMwBAHABBENBXCgAAAAAAAAAAAAAAAAAAEBoFAAVgAuACGAGQAMsAagA2QB2AD8AIAAQUAjABSwCngFXgLQAtIBrADeAHVAPkAhsBDoCKgEXgJEATYAnYBSIC5AGBAMJAYeAxgBk4DOQGeAM-AckA5QB1hKB6AAgABYAFAAMgAcABFADAAMQAeABEACYAFUALgAXwAxABmADaAIQAQ0AiACJAEcAKMAUoAtwBhADKAGqANkAd4A_ACMAEcAKeAVeAtAC0gF1AMUAbgA4gB1AD5AIdARUAi8BIgCbAFigLYAXaAvMBh4DIgGTgMsAZyAzwBnwDSAGsAOAAdYA7UpBQAAXABQAFQAMgAcgA-AEAAIoAYABjADQANQAeQBDAEUAJgATwApABVACwAFwAL4AYgAzABzAEIAIaARABEgCjAFKALEAW4AwgBlADRAGqANkAd8A-wD9AIsARgAjgBKQCggFDAKuAVsAuYBeQDFAG0ANwAegBDoCLwEiAJsATsAocBTQCtgFigLYAXAAuQBdoC8wGGgMPAYwAyIBkgDJwGXAM5AZ4Az6BpAGkwNYA1kBscDkwOUAcuA6wB2oDxyEEIABYAFAAMgAiABcADEAIYATAAqgBcAC-AGIAMwAbwA9ACOAFiAMIAZQA1ABvgDvgH2AfgA_wCMAEcAJSAUEAoYBTwCrwFoAWkAuYBfgDFAG0AOoAegBIICRAEqAJsAU0AsUBaMC2ALaAXAAuQBdoDDwGJAMiAZOAzkBngDPgGiANJAaWA1UBwADkgHRgOsAdqA8cdBxAAXABQAFQAMgAcgA-AEAAIgAXQAwADGAGgAagA8AB9AEMARQAmABPgCqAKwAWIAuAC6AF8AMQAZgA3gBzAD0AIQAQ0AiACJAEdAJYAmABNACjAFKALEAW8AwgDDAGQAMoAaIA1ABsgDfAHeAPaAfYB-gD_AIHARYBGACOQEpASoAoIBTwCrgFigLQAtIBcwC6gF5AL8AYoA2gBuADiQHTAdQA9ACGwEOgIiARUAi8BIICRAEqAJsATsAocBTQCrAFigLQgWwBbIC4AFyALtAXeAvMBgwDCQGGgMPAYkAxgBjwDJAGTgMqAZYAy4BnIDPgGiQNIA0kBpYDTgGqgNYAbGA4uByQHKgOXAdGA6wB44D0gHqhILYACAAFwAUABUADIAHIAPABAACIAGEANAA1AB5AEMARQAmABPgCqAKwAWAAuABvADmAHoAQgAhoBEAESAI6ASwBLgCaAFKALcAYYAyABlwDUANUAbIA7wB7AD4gH2AfoBAACBwEXARgAjQBHACUgFBAKWAU8Aq4BcwC_AGKANYAbQA3ABvADiAHoAPkAhsBDoCLwEiAJiATKAmwBOwChwFIgLFAWgAtgBcgC7wF5gMCAYMAwkBhoDDwGRAMkAZOAy4BnIDPgGkANOgawBrMDkQOVAcuA6MB1gDxxkBwACgAQwAmABcAEcAMsAagA7IB9gH4ARgAjgBSwCrgFbAN4AmIBNgC0QFsALzAYEAw8BkQDOQGeAM-AckA5QVAfAAoAEMAJgAXABHADLAGoAOwAfgBGACOAFLAKvAWgBaQDeAJBATEAmwBTYC2AFyALzAYEAw8BkQDOQGeAM-AbkA5IBygAA.YAAAAAAAAAAA; smc="{}"; userUUID=29668197-3ec8-4380-83a0-af74497d4652; askToSaveAlertPopUp=true; cookieSearch-1="/areas/venta-terrenos/con-precio-hasta_30000,precio-desde_5000,metros-cuadrados-mas-de_500,metros-cuadrados-menos-de_5000,terrenos-urbanos/?shape=%28%28uyxcF%7Cqxl%40%7BxEyrlCput%40ghjBfnw%40%7Dnr%40rlxA%7Dj_AjdWq%60xAr%60r%40_pUvdy%40rwg%40r%7CLr%7BZmfy%40f%7DsB_%7EwBb_aDyj%7DA%7CsoBupaA%7CaO%29%29:1631778650981"; uc="jNLcI107+7z1wRs0x4TdO3u5jcaE+Wrl/o5Drt4SE9qHCxSMOrDfJCS1OVr9tkKQ1xbkhwtCXOZOKw1BLkvMTAMsML+Z10HjHWdUIRhRkRXsEcNnPFEt9rqCk0DCCd7EBSE6A/jp5vs="; nl="wrtrmuF9QzNOYYO2P8SN3OqyjHQXevAY7aYvx0cKdUNwML7qYn47dSs63/pFStgOTH50K6V1y0hMkNG4T70na63g0fJdDSpgDegfruZFCA9GnVx058kgR638a8Q81Gz9r1nzfAqJdfs="; SESSION=2e11a953aead5c1f~d77af03d-2746-4657-89a6-ab25cb02e889; contactd77af03d-2746-4657-89a6-ab25cb02e889="{\'email\':\'EqU9vfX+DxXdzl+iReclTDED4UB/DY6iSfTeSLvLlsE=\',\'phone\':\'634821160\',\'phonePrefix\':null,\'friendEmails\':null,\'name\':\'RL9PoIsz5dW5rvLNF/0gvA==\',\'message\':null,\'message2Friends\':null,\'maxNumberContactsAllow\':10,\'defaultMessage\':true}"; sendd77af03d-2746-4657-89a6-ab25cb02e889="{\'friendsEmail\':null,\'email\':\'EqU9vfX+DxXdzl+iReclTDED4UB/DY6iSfTeSLvLlsE=\',\'message\':null}"; datadome=Gegr2x~z62hH2U.ay-n4sMd3xgi-RaB1X5XMr4i5qV2q6GYsINszxSNDS732-spxaAaUp.m7aGMcOgN-DcAxFY9KCQsldTsDl-RVS5ocEm; cc=eyJhbGciOiJIUzI1NiJ9.eyJjdCI6OTk2MjMyNiwiZXhwIjoxNjMxOTUxOTc4fQ.0YJZGR34jgb1SWTAzL9DFPhAeeP4k0hE9igRE8-SQp0' \
--compressed | grep -m 1 "<meta name=\"description\" content=" | sed -E 's;^.*(content=\");;;s;\">$;;' 

My question is why doesn't sed -E 's;^.*(content=\")\">$;;' work? It was meant to give me this result:

terreno de 936 m², Terreno en venta en paseo Blasco Ibáñez s/n, Costa Esuri, Ayamonte, Costa Esuri 

wheras it gives me the untrimmed

<meta name="description" content="terreno de 936 m², Terreno en venta en paseo Blasco Ibáñez s/n, Costa Esuri, Ayamonte, Costa Esuri">

One may need to go to https://www.idealista.com/inmueble/94238881/, then open developer tools in their browser and copy as cURL in order to play with this example.

4
  • 1
    What is ^.*(content=\")\">$ supposed to match? afaics, that will only match something like ... stuff ... content="">. Only you know how you thought that should work ;-) Commented Sep 17, 2021 at 9:24
  • @UncleBilly In response to your comment I just added more information to the question. Later I will delete this comment in order not to clutter USE ;-) Commented Sep 17, 2021 at 9:48
  • 1
    Matching two consecutive double quotes ("") won't match anything. You need someting like 's#^.*(content=\")([^"]+)\">$#\2#'. Commented Sep 17, 2021 at 9:54
  • Don't parse HTML with regular expressions. There are far too many ways that can fail. Use an HTML parser instead. If you want to write a web bot, I recommend using perl, with the LWP library or related modules like WWW::Mechanize or Web::Scraper, or python with libs like Beautiful Soup or Mechanize Commented Sep 18, 2021 at 4:51

1 Answer 1

3

There are two problems with your RegEx. It may be due to a misunderstanding on how RegExes work.

  1. You have used the "extended" regular expression syntax, which makes () special characters used to denote capture groups. However, they do otherwise not interfere with the matching mechanism itself. Since you don't make use of the capture group, your RegEx amounts to
    ^.*content=\"\">$
    
    which expects the pattern content="" with an empty quoted string, and immediately followed by the closing >. This doesn't occur in your input, so sed does nothing as no match is achieved. (By the way, you don't need to escape the " - they are not special in RegExes, and your program is in single-quotes, so the shell won't misinterpret the " either.)
  2. Even if you correct that, as in
    ^.*content="[^"]*".*>$
    
    you replace the matched part of the line, which due to the anchors is the entire line, with the empty string, so in your case, nothing would remain.

To alleviate, you will need to recur to the original idea of using a capture group, but use that to contain the relevant part of the line and then replace the entire line with the content of the capture group, as in:

sed -E 's;^.*content="([^"]*)".*$;\1;'

This will define the content between the first and next " after the content= attribute name as capture group, but otherwise match the entire line. It will then replace the entire line with only the content of the capture group via the \1 expression.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.