I’m using regular expressions to parse a websites source code and display a news headline in a Tkinter window. Have been told parsing HTML with regex isn’t the best idea, but unfortunately do not have the time to change now.
Can’t seem to be able to replace the HTML code for special characters such as apostrophe.
Currently have the following:
union_url = 'http://www.news.com.au/sport/rugby'
union_string = urlopen(union_url).read()
union_headline = re.findall('(?:sport/rugby/.*) >(.*)<', union_string)
union_headline_label= Label(union_window, text = union_headline, font=('Times',20,'bold'), bg = 'White', width = 85, height = 3, wraplength = 500)
This doesn’t get rid of HTML characters. As an example, headline prints as
“Larkham: Real worth of ‘Giteau’s Law’”
Have tried to find an answer without any luck. Any help is much appreciated