2

I am trying to scrape a website in which the request headers are having some new (for me) attributes such as :authority, :method, :path, :scheme.

{':authority':'xxxx',':method':'GET',':path':'/xxxx',':scheme':'https','accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','accept-encoding':'gzip, deflate, br','accept-language':'en-US,en;q=0.9','cache-control':'max-age=0',GOOGLE_ABUSE_EXEMPTION=ID=0d5af55f1ada3f1e:TM=1533116294:C=r:IP=182.71.238.62-:S=APGng0u2o9IqL5wljH2o67S5Hp3hNcYIpw;1P_JAR=2018-8-1-9',   'upgrade-insecure-requests': '1',   'user-agent': 'Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/68.0.3440.84Safari/537.36',   'x-client-data': 'CJG2yQEIpbbJAQjEtskBCKmdygEI2J3KAQioo8oBCIKkygE=' }

I tried passing them as headers with http request but ended up with error as shown below.

ValueError: Invalid header name b':scheme'

Any help would be appreciated on understanding and guidance on using them in passing request.

EDIT: code added

import requests

url = 'https://www.google.co.in/search?q=some+text'

headers = {':authority':'xxxx',':method':'GET',':path':'/xxxx',':scheme':'https','accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','accept-encoding':'gzip, deflate, br','accept-language':'en-US,en;q=0.9','cache-control':'max-age=0','upgrade-insecure-requests': '1',   'user-agent': 'Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/68.0.3440.84Safari/537.36',   'x-client-data': 'CJG2yQEIpbbJAQjEtskBCKmdygEI2J3KAQioo8oBCIKkygE=' }

response = requests.get(url, headers=headers)

print(response.text)
4
  • Please include your code, so that the error might get reproduced. Commented Aug 1, 2018 at 10:06
  • Header names are not supposed to contain colons, since colons are used as a delimiter in headers. Commented Aug 1, 2018 at 10:06
  • @LucaCappelletti Code added. Commented Aug 1, 2018 at 10:13
  • @blhsing Thanks for noticing. But still i did not get the proper response. Can you elaborate about those header attributes? Commented Aug 1, 2018 at 10:14

2 Answers 2

2

Your error comes from here (python's source code)

Http headers cannot start with a semicolon as RFC states.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for your answer. I have removed them and code works fine now. But still i did not get the proper page response yet. Can you help in it?
why do you need these headers?
I am trying to get the response of the webpage. So trying different methods. I found these headers as strange and thought these might be the reason for not getting the response.
maybe the page is render with javascript so you need something like selenium or github.com/miyakogi/pyppeteer
Sure. I will check these.
1

:authority, :method, :path, :scheme are not http headers

https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

':method':'GET'

defines http request method

https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods

and

:authority, :path, :scheme

are parts of URI https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Generic_syntax

2 Comments

Are they playing any role in getting a web page's response?
yes, but you are using them elsewhere in your code. requests.get() represents method, and url = 'google.co.in/search?q=some+text' is your URI (www.google.co.in is an authority, https is a schema and /search is a path.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.