Extract Data Between Html Tags Using Beautifulsoup In Python
Note that the single quotes only appear because you are asking the interactive interpreter to display a string value. You will find that
>>>print(soup.title.contents[0])
displays
" CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN "
and that is actually the contents of the title tag. You will observe that Beautiful Soup has converted the "
HTML entities into the required double-quote characters. To lose the quotes and adjacent spaces you can use
soup.title.contents[0][2:-2]
The meta tag is a little tricker. I make the assumption that there is only one <meta>
tag with an http-equiv
attribute whose value is "refresh", so the retrieval returns a list of one element
. You retrieve that element like so:
>>> meta = soup.findAll("meta", {"http-equiv": "refresh"})[0]
>>> meta
<meta content="0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1" http-equiv="refresh"/>
Note, by the way, that meta isn't a string but a soup element:
>>> type(meta)
<class'bs4.element.Tag'>
You can retrieve attributes of a soup element using indexing just like Python dicts, so you can get the value of the content
attribute as follows:
>>> content = meta["content"]
>>> content
u'0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1'
In order to extract the URL value you could just look for the first equals sign and take the rest of the string. I prefer to use a rather more disciplined approach, splitting at the semicolon and then splitting the right-hand element of that split on (only one) equals sign.
>>> url = content.split(";")[1].split("=", 1)[1]
>>> url
u'/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1'
Solution 2:
To get substring from url of meta tag you need to use some regex. I think you can try this out
soup = BeautifulSoup(<your html string>)
meta_url = soup.noscript.meta['content']
url = re.search('\-\/(.*)\?', meta_url).group(1)
print url
print soup.title.text
Hope above code solves your problem.
Post a Comment for "Extract Data Between Html Tags Using Beautifulsoup In Python"