Geocoder

class Location

class factfinder.src.geocoder.Location[source]

Bases: object

This class is aimed to efficiently geocode addresses using Nominatim. Geocoded addresses are stored in the ‘book’ dictionary argument. Thus, if the address repeats – it would be taken from the book.

max_tries = 3
geocode_with_retry(query: str) List[float] | None[source]

Function to handle 403 error while geocoding using Nominatim. TODO: 1. Provide an option to use alternative geocoder TODO: 2. Wrap this function as a decorator

query(address: str) List[float] | None[source]

class Streets

class factfinder.src.geocoder.Streets[source]

Bases: object

This class encapsulates functionality for retrieving street data for a specified city from OSM and processing it to extract useful information for geocoding purposes.

global_crs: int = 4326
static get_city_bounds(osm_city_name: str, osm_city_level: int) GeoDataFrame[source]

Method retrieves the boundary of a specified city from OSM using Overpass API and returns a GeoDataFrame representing the boundary as a polygon.

static get_drive_graph(city_bounds: GeoDataFrame) MultiDiGraph[source]

Method uses the OSMnx library to retrieve the street network for a specified city and returns it as a NetworkX MultiDiGraph object, where each edge represents a street segment and each node represents an intersection.

static graph_to_gdf(G_drive: MultiDiGraph) GeoDataFrame[source]

Method converts the street network from a NetworkX MultiDiGraph object to a GeoDataFrame representing the edges (streets) with columns for street name, length, and geometry.

static get_street_names(gdf: GeoDataFrame)[source]

Method extracts the unique street names from a GeoDataFrame of street segments.

static drop_words_from_name(x: str) str[source]

This function drops parts of street names that are not the name of the street (e.g. avenue).

static clear_names(streets_df: DataFrame) DataFrame[source]

This function pre-process the street names from the OSM. This step is necessary to match recognised street addresses later. We need to do this match because Nominatim is very sensitive geocoder and requires almost exact match between addresses in the OSM database and the geocoding address.

static run(osm_city_name: str, osm_city_level: int) DataFrame[source]

class Geocoder

class factfinder.src.geocoder.Geocoder(model_path: str = 'Geor111y/flair-ner-addresses-extractor', device: str = 'cpu', osm_city_level: int = 5, osm_city_name: str = 'Санкт-Петербург')[source]

Bases: object

This class provides a functionality of simple geocoder

dir_path = '/home/docs/checkouts/readthedocs.org/user_builds/soika/checkouts/latest/factfinder/src'
global_crs: int = 4326
exceptions = Сокращенное наименование  ... Код страны 2-х буквенный 0          ФОЛКЛЕНДСКИЕ О-ВА  ...                       FK 1                 МИКРОНЕЗИЯ  ...                       FM 2             ФАРЕРСКИЕ О-ВА  ...                       FO 3                    ФРАНЦИЯ  ...                       FR 4                      ГАБОН  ...                       GA ..                       ...  ...                      ... 264                   москва  ...                      NaN 265                      мск  ...                      NaN 266                      МСК  ...                      NaN 267                Петербург  ...                      NaN 268               петербург   ...                      NaN  [269 rows x 4 columns]
extract_ner_street(text: str) Series[source]

Function calls the pre-trained custom NER model to extract mentioned addresses from the texts (usually comment) in social networks in russian language. The model scores 0.8 in F1 and other metrics.

get_ner_address_natasha(exceptions, text_col)[source]
static get_stem(street_names_df: DataFrame) DataFrame[source]

Function finds the stem of the word to find this stem in the street names dictionary (df).

find_word_form(df: DataFrame, strts_df: DataFrame) DataFrame[source]

In the russian language any word has different forms. Since addresses are extracted from the texts in social networks they might be in any possible form. This function is aimed to match that free form to the one that is used in the OSM database.

Since the stem is found there would be several streets with that stem in their name. However the searching street name has its specific ending (form) and not each matched street name could have it.

TODO: add spellcheker since there might be misspelled words.

static get_level(row: Series) str[source]

Addresses in the messages are recognized on different scales: 1. Where we know the street name and house number – house level; 2. Where we know only street name – street level (with the centroid geometry of the street); 3. Where we don’t know any info but the city – global level.

get_street(df: DataFrame, text_column: str) GeoDataFrame[source]

Function calls NER model and post-process result in order to extract the address mentioned in the text.

create_gdf(df: DataFrame) GeoDataFrame[source]

Function simply creates gdf from the recognised geocoded geometries.

set_global_repr_point(gdf: GeoDataFrame) GeoDataFrame[source]

This function set the centroid (actually, representative point) of the geocoded addresses to those texts that weren’t geocoded (or didn’t contain any addresses according to the trained NER model).

merge_to_initial_df(gdf: GeoDataFrame, initial_df: DataFrame) GeoDataFrame[source]

This function merges geocoded df to the initial df in order to keep all original attributes.

run(df: DataFrame, text_column: str = 'Текст комментария')[source]