Trying to use Beautiful Soup (Python) to find 2 partial matches in an attribute’s value

(This is a follow-up question to a previous post, which user http://stackoverflow.com/users/771848/alecxe helped me with. Makes more sense to post this follow-up as an independent question though, so it is more searchable for others.)

I have a python script using Beautiful Soup to locate some web reports on a hosting service.

Right now the script is pretty exacting. I would like to make it a bit more flexible. I feel like reg-ex is what I need, but maybe some nested searches would work too. I’m open to suggestion.

My current code works like:

def search_table_for_report(table, report_name, report_type):
    #search rows of table to find given report name, then grab the download URL for the given type
    for row in table.findAll('tr')[1:]:
        #the [1:]: modifier instructs the loop to skip the first item, aka the headers.
        col = row.findAll('td')

        if report_name in col[0].string:
            print "----- parse out file type request url"
            report_type = report_type.upper()
            #this works, using exact match
            label = row.find("input", {"aria-label": "Select " + report_name + " I format " + given_type})
            #this doesn't work, using reg-ex
            #label = row.find("input", {"aria-label": re.compile("b" + report_name + ".*b" + report_type + ".*")})

            print "----- okay found the right checkbox, now grab the href link ----"
            link_url = label.find_next_sibling("a", href=True)["href"]
            return link_url  

Which would search through a table like this:

<tr class="odd">
 <td header="c1">
  Report Download
 </td>
 <td header="c2">
  <input aria-label="Select Report I format PDF" id="documentChkBx0" name="documentChkBx" type="checkbox" value="5446"/>
  <a href="/a/document.html?key=5446">
   <img alt="Portable Document Format" src="http//stackoverflow.com/img/icons/icon_PDF.gif">
   </img>
  </a>
  <input aria-label="Select Report I format XLS" id="documentChkBx1" name="documentChkBx" type="checkbox" value="5447"/>
  <a href="/a/document.html?key=5447">
   <img alt="Excel Spreadsheet Format" src="http//stackoverflow.com/img/icons/icon_XLS.gif">
   </img>
  </a>
 </td>
 <td header="c4">
  04/27/2015
 </td>
 <td header="c5">
  05/26/2015
 </td>
 <td header="c6">
  05/26/2015 10:00AM EDT
 </td>
</tr>

I’d like to search the “aria-label” value for two values, or two partial matches within it. Essentially, sometimes instead of finding “Select Report format XLS”, I may need to find “Select Matrix format PDF”. Pretty sure the “select” and “format” bit will always be there but can’t be sure, so just need to make the 2nd word and final extension type be partial match searches. The partial bit (instead of exact) is important because sometimes the “report” word may have trailing words I don’t expect, like “Select Report II format XLS”, etc, which would fail if it was an exact search for “Select Report format XLS”.

So I need code (regex presuambly) that will search for a given name (in place of Report) and a given type (in place of XLS) This is what I’ve tried but it’s not working. I think the reg-ex syntax is good, but I think I’m jamming the re.compile in the wrong spot, using it in a way that Beautiful Soup does not expect.

label = row.find("input", {"aria-label": re.compile("b" + report_name + ".*b" + report_type + ".*")}) 

Hope I explained that well. Happy to clarify any confusion.


Source: regex

Leave a Reply