Extract Data Between Html Tags Using Beautifulsoup In Python

Question

I want to extract the data between the html tag 'title' and in the 'meta' tag, I want to extract value of URL attribute and that too the text just before the '?'. Copy

Note that the single quotes only appear because you are asking the interactive interpreter to display a string value. You will find that

>>>print(soup.title.contents[0])

displays

" CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN "

and that is actually the contents of the title tag. You will observe that Beautiful Soup has converted the " HTML entities into the required double-quote characters. To lose the quotes and adjacent spaces you can use

soup.title.contents[0][2:-2]

The meta tag is a little tricker. I make the assumption that there is only one <meta> tag with an http-equiv attribute whose value is "refresh", so the retrieval returns a list of one element. You retrieve that element like so:

>>> meta = soup.findAll("meta", {"http-equiv": "refresh"})[0]
>>> meta
<meta content="0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1" http-equiv="refresh"/>

Note, by the way, that meta isn't a string but a soup element:

>>> type(meta)
<class'bs4.element.Tag'>

You can retrieve attributes of a soup element using indexing just like Python dicts, so you can get the value of the contentattribute as follows:

Baca Juga

>>> content = meta["content"]
>>> content
u'0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1'

In order to extract the URL value you could just look for the first equals sign and take the rest of the string. I prefer to use a rather more disciplined approach, splitting at the semicolon and then splitting the right-hand element of that split on (only one) equals sign.

>>> url = content.split(";")[1].split("=", 1)[1]
>>> url
u'/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1'

Html5 Library

Extract Data Between Html Tags Using Beautifulsoup In Python

Solution 2:

Post a Comment for "Extract Data Between Html Tags Using Beautifulsoup In Python"