Opened 9 years ago

Closed 7 years ago

#917 closed defect/bug (fixed)

Some address are not found due to unwritable chars

Reported by: simone bria Owned by: Simone Briatore
Priority: major Milestone: version 0.5.1
Component: port/android Version: git master
Severity: Keywords: address search non-english name, i18n
Cc: simone.bria@…, http://wiki.navit-project.org/index.php/user:mvglasow

Description

My configuration is the latest daily build for Android with OSM maps and IT locale. The problem is that i cannot search addresses in some city such as Istanbul due to some chars that cannot be written: for example in Istanbul the first letter, the I, is memorized into the map as the char U+130 (uppercase 'I' with a dot over it), as it is written in the turkish language. For others city, for example Mondovì, where i live in Italy, there is no problem because the "strange" char is at the end but for the Istanbul town there is no way to find any address in the city.

I think that it would be good that the search would do a search also of similar chars (example: when i write I on the search, navit searches all the chars with I: with accent, without accent, etc, to let all locales to be able to search every language.

Attachments (6)

navit_4692_ticket_917.patch (3.8 KB) - added by mvglasow (2) 9 years ago.
Complete character translation for letters in Latin-1 Supplement and Latin Extended-A
navit_4710_ticket_917.patch (3.7 KB) - added by mvglasow (2) 9 years ago.
binfile.c: Handle diacritics in town, district and all street name searches
ling_compare.diff (6.6 KB) - added by tryagain 9 years ago.
Implementation of idea above. Did not touch common search code, changed only POI search to use it.
ling_compare2.diff (11.9 KB) - added by tryagain 9 years ago.
Town, country, street, housenumber searches are switched to use new algorithm.
ling_compare3.diff (12.8 KB) - added by tryagain 9 years ago.
More comments.
ling_compare6.diff (13.7 KB) - added by tryagain 9 years ago.
Failed assertion expanding Cyrillic logatures is fixed.

Download all attachments as: .zip

Change History (24)

comment:1 Changed 9 years ago by mvglasow (2)

  • Cc http://wiki.navit-project.org/index.php/user:mvglasow added

As far as I know this generally works, but only for some characters:

  • n = ń (Polish)
  • s = š (some Slavic and Baltic languages)
  • and probably more, these are the ones I remember

It seems some characters are missing in the translation table. I do know that ī (Rīga) and ū (Jūrmala) will not be found with the "undecorated" variants (which was a real annoyance on my vacation).

Some Baltic characters which may need to be added to the list: ā, ē, ī, ū, ą, ę, į, ų, ė (of which ą and ę might work already).

The cleanest solution would be to go through the whole extended Latin character sets in Unicode and add the whole list rather than by try and error.

Last edited 9 years ago by mvglasow (2) (previous) (diff)

comment:2 Changed 9 years ago by mvglasow (2)

Edit: tried a few and found

  • e = ę (and supposedly also a = ą)
  • n = ń
  • s = š

but

  • e != ė
  • i != ī
  • u != ū

So we need to extend the translation table.

Changed 9 years ago by mvglasow (2)

Complete character translation for letters in Latin-1 Supplement and Latin Extended-A

comment:3 Changed 9 years ago by mvglasow (2)

I finally found the piece of code where this translation is done; there was quite a lot missing. A patch is attached. This contains A-Z equivalents (digraphs where I knew them) for all letters in the following Unicode ranges:

  • Latin-1 Supplement (Western Europe)
  • Latin Extended-A (many Central and East European languages)

As for Latin Extended B, C, D and Latin Extended Additional – I cannot tell which of these may actually occur in place names, so I didn't include any of them. However, for all parts of Europe which use the Latin alphabet, as well as for the Americas, this should work now.

Unfortunately, I don't have a working build environment at the moment, so I can't test it, but since I've just added extra members to an array, risk should be low. Maybe someone else can run some test cases against it, e.g.:

  • Riga should find Rīga (Latvia)
  • Siauliai should find Šiauliai (Lithuania)
  • Klaipeda should find Klaipėda (Lithuania)
  • Istanbul should find İstanbul (Turkey)
  • Forli should find Forlì (Italy)

and, if it works OK, apply the patch. Thanks!

comment:4 Changed 9 years ago by korrosa

  • Resolution set to fixed
  • Status changed from new to closed

Patch applied in http://navit.svn.sourceforge.net/navit/?rev=4706&view=rev. Thanks mvglasow!

If the problem persists, please re-open the ticket.

comment:5 Changed 9 years ago by mvglasow (2)

  • Resolution fixed deleted
  • Status changed from closed to reopened

This solved it only partly. For street names the patch seems to work nicely, i.e. when searching in Ventspils (Latvia), searching for "riga" will find Rīgas iela.

However, behavior for cities is unchanged. Searching in Lithuania for "klaip" will find Klaipėda, searching for "klaipe" will return an empty list.

My suspicion is that street search uses the code in linguistics.c while city search does not (there seems to be something similar, though, as Sventoji has always found Šventoji and similar). This will need to be fixed in order for the previous patch to work also for cities. Ideally, all search function should rely on the routines in linguistics.c and any code duplication for fuzzy matching should be avoided.

comment:6 Changed 9 years ago by mvglasow (2)

Suspicion confirmed...

The code in binfile.c, function binmap_search_get_item, performs this translation only for street names, not for city or district names. This needs to be fixed... I'll need to understand the code first, though.

comment:7 Changed 9 years ago by korrosa

Before you go off trying to understand the code, please first update the maps your are testing with: the change for town names is in the maps -> see http://irclogs.navit.ie/%23navit-2011-08-25.log at 1705hrs

If you're getting maps from the planet extractor, the change should have been integrated in the latest map, so download a fresh copy. If you generate your own, update your version of maptool and re-generate your map.

Let us know if this improves the situation....

comment:8 Changed 9 years ago by mvglasow (2)

too late, I think I have an idea of what's going on in the code already ;-)

On a serious note, I'll test with a fresh map (mine is about a month old), though I don't think it will change much... the linguistics.c magic causes Navit to effectively accept multiple spellings for a town name, e.g.

  • Groß-Schönhausen
  • Gross-Schonhausen
  • Gross-Schoenhausen

and possibly more. It's probably more efficient to do that expansion at run-time rather than store all admissible spellings in the map... but in any case I'll check before I start coding.

comment:9 Changed 9 years ago by mvglasow (2)

Just tested with build 4710 on WinCE and a fresh map of Latvia and Lithuania. No change in behavior with the new map: in town search, "klaip" will find Klaipėda while "klaipe" finds nothing and "riga" won't find Rīga. In street search, "riga" will find Rīgas iela.

The good news is that we're not wasting any space in the map by storing all kinds of adapted spellings for names :-) please don't even think about doing that in map data...

Anyways, I know where to change the code, just need to get my build env up and working again so I can test it...

Changed 9 years ago by mvglasow (2)

binfile.c: Handle diacritics in town, district and all street name searches

comment:10 Changed 9 years ago by mvglasow (2)

Patch attached. This one will handle diacritics in town and district searches, as well as in all street name searches. To be sure and consistent, I also included it for house number searches.

PS: my build env is up and running again and the patch is tested against it, using both my old map (obtained in July with Planet Extractor) and a fresh one (the one I used 3 hours earlier). The map doesn't make a difference, it works with either.

Last edited 9 years ago by mvglasow (2) (previous) (diff)

comment:11 Changed 9 years ago by tryagain

Current code seems to be unable to find "kārdināšanā" when search string is "kaardinaasanaa". But http://en.wikipedia.org/wiki/Latvian_language#Latvian_on_computers lists kaardinaasanaa as "internet style" spelling for kārdināšanā. Using special[i][1] for replacement when mode==2 and special[i][2]==NULL should solve this.

Next, do we expect somebody entering 'Standlerstrasse' to look for 'Ständlerstraße'? That boy is really looking for problems as he uses different letter presentation rules in one word. But maybe he knows about ß being ss but not about ä being ae.

To fix that case we can do 3n comparisons for each word, where n is the number of 'special' characters. That will definitely become too cpu-hungry especially if we plan to adopt this algorithm for Cyrillic letters romanization, when each letter will be 'special'.

I have another idea, to use recursion (or stack data structure to simulate it) in comparison algorithm. That will be far more efficient as we can prevent string fuzzing when first non-matching char is found. That seems to be even more efficient than current algorithm as we won't do any string fuzzing to compare words which begin with different letters. Will think of it.

Changed 9 years ago by tryagain

Implementation of idea above. Did not touch common search code, changed only POI search to use it.

Changed 9 years ago by tryagain

Town, country, street, housenumber searches are switched to use new algorithm.

comment:12 Changed 9 years ago by tryagain

Above patch should efficiently solve problems pointed in my comment 11 both for POI and contry/town/street/housenumber search.

I switched off using g_utf8_casefold because it's not available on all platforms (internal glib doesn't support it) and because it replaces ß with ss making it impossible to match "Straße" with "strase".

Also Cyrillic support is now there.

Changed 9 years ago by tryagain

More comments.

comment:13 Changed 9 years ago by mvglasow (2)

I agree - what I submitted was just a quick patch, tryagain's approach is probably cleaner. You have my blessing for it :-)

comment:14 Changed 9 years ago by zoff99

i compiled maptool with ling_compare3.diff added and got a lot of segfaults when creating maps did somebody acutally try this patch with a real osm mapfile?

europe does not build with this patch enabled

Changed 9 years ago by tryagain

Failed assertion expanding Cyrillic logatures is fixed.

comment:15 Changed 9 years ago by tryagain

This patch isn't in production yet because of need to clarify if it will be compatible with foregoing indexed search of places.

I'm sure it's compatible. I'll try to write here down some thoughts on doing binary search (do we mean something like that by indexed search term?).

First of all, there will be no need to store different names spelling in the index. We'll store the standard one as it present in OSM data. So call to linguistics_expand_special will go off giving us up to three times index size decrease (depending on language).

Here we go.

We'll need a data structure to store intermediate search results:

struct search_pvt  {
     index_item *low, *high;
};

... and a function to do one search step:

struct search_pvt* search_step(char *query, struct search_pvt *boundary);

Above function should dig into the index within boundaries until beginning of the string sitting in the middle of [low,high] interval matches the query string. If match is found then search boundaries are returned in new search_pvt structure. If no match found, NULL is returned.

Surronding search algorithm will try to expand each character starting from the beginning of the string and call search_step multiple times placing original and all expanded versions of current character to the the end of substring.

If search_step returns NULL, following characters are not expanded, algorithm goes one letter back to try another variant. If no variants are left, another step back is done, and so on. If we reach end of query string, then all search results of current branch are laying within high and low pointers and it's easy to get them all from the index now, after that we go one character back again.

By term "expand character" used above I mean looking up into the elements 1..2 of specials[i][] arrays to find what special characters specials[i][0] may have been converted to current non-special character. This lookup is to be done through hash, of course.

Far out of the scope of this ticket are questions about how index should be organized on disk, will it be loaded into the memory before doing the search or no and so on. Regardless the solutions chosen I feel suggested algorithm should be effective enough.

comment:16 Changed 7 years ago by usul

  • Keywords i18n added
  • Milestone set to version 0.5.1

Is this bug really only related to Android?

Over at OSM, we also provide some international namespaces, as name:de=* for example. This might be useful, to get the english name for example in arabic/asia countries, where strangers will completely fail to read the native language?

A full working address search is an essential feature -> scheduled to hotfix

comment:17 Changed 7 years ago by tryagain

Actually, we already have imported rules needed to remove Latin and Cyrillic accents. Istanbul is findable now.

Please feel free to reopen if there are some accented letters which you can't enter in unaccented form into the search field to get proper results.

Regarding different namespaces, I would like to keep it in another ticket and different milestone, see #1159.

comment:18 Changed 7 years ago by tryagain

  • Resolution set to fixed
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.