Binary search (bisect) for a Unicode character in Python 2.7

I have a list, J, of 2048 Japanese Unicode words which ideally I’d like to index using a binary search, as discussed in this SO question (“Binary search (bisection) in Python). The binary_search function uses bisect_left from the bisect module instead of indexing the list with the list.index function.

import bisect
def binary_search(a, x, lo=0, hi=None): # can't use a to specify default for hi
    hi = hi if hi is not None else len(a)   # hi defaults to len(a)
    pos = bisect.bisect_left(a, x, lo, hi)  # find insertion point
    return (pos if pos != hi and a[pos] == x else -1) 

Now, let’s index each word in words:

words = "そつう れきだい ほんやく わかす りくつ ばいか ろせん やちん そつう れきだい ほんやく わかめ"
>>> words.split()
[u'u305du3064u3046', u'u308cu304du305fu3099u3044', u'u307bu3093u3084u304f', u'u308fu304bu3059', u'u308au304fu3064', u'u306fu3099u3044u304b', u'u308du305bu3093', u'u3084u3061u3093', u'u305du3064u3046', u'u308cu304du305fu3099u3044', u'u307bu3093u3084u304f', u'u308fu304bu3081']`)

Comparing binary_search with list.index:

>>> [ binary_search(J, x) for x in words.split()]
[-1, 2015, 1790, 2039, 1983, -1, 2031, 1919, -1, 2015, 1790, 2040]
>>> [ J.index(x) for x in words.split()]
[1019, 2015, 1790, 2039, 1983, 1533, 2031, 1919, 1019, 2015, 1790, 2040]

So for u'u305du3064u3046' (そつう), instead of an index of 1019, binary_search is returning an index of -1 (fail) since pos evaluates to 1027, u'u305du3068u304bu3099u308f'. Both words (index 1019 & 1027) begin with u'u305d', so that is where the issue seems to be.

How can binary_search be tweaked to find the index of a (Japanese) Unicode character?

Source: python

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.