Identifying Irish addresses by county - Thomas Bibby

My friend Emmet called me with an interesting problem today. He had a spreadsheet with 28k rows. One of the columns was an address, sort of separated by commas. The address column was very inconsistent. Some ended in Ireland, others just had the county name, some used the form “County Limerick”, others “Co. Limerick”, and others still just “Limerick”. He needed to do some calculations by county so he needed to extract the county name for each row. As I learned doing the pubs research, Irish addresses are a pain!

I’d dealt with this sort of problem before (not with 28k rows mind!) when I used to work for national charities and it used to bug me terribly.

Emmet found a way to anonymise the data so he could send me a subset. I had a play around with a spreadsheet-based solution before breaking out the Python for a quick hack.

The script below reads in a file called “input.csv” and writes to a file called “output.csv” with the same data, but with the county name and a comma added to the start of each line (or Unknown if the script couldn’t work it out).

The script is case-insensitive, and matches the rightmost county on each line, so an address in Omeath, Co. Louth is correctly identified as being in Louth, and Dublin Rd., Athlone, Co. Westmeath matches Westmeath, not Dublin.

The script is fairly flexible about the structure of the input file, so the address data can be in different columns, or all in one.

Here’s the script:

###Usage: python parse-csv.py###
###Input file must be called input.csv###


#function definition: input_text is a string, all_ireland is a bool
def prepend_address_with_county(input_text,all_ireland):
    #prepare the county list
    counties_list = ['Carlow','Cavan','Clare','Cork','Donegal','Dublin','Galway','Kerry','Kildare','Kilkenny','Laois','Leitrim','Limerick','Longford','Louth','Mayo','Meath','Monaghan','Offaly','Roscommon','Sligo','Tipperary','Waterford','Westmeath','Wexford','Wicklow']
    #add on the six counties if we want all-Ireland
    if all_ireland == True:
        counties_list.extend(['Antrim','Armagh','Derry','Down','Fermanagh','Tyrone'])
    outfile = ''
    errorcount = 0
    linecount = 0
    #loop over each line
    for line in input_text:
        linecount += 1
        #keep track of the county we're going to feed in
        county_match = ''
        #let's keep track of the index of what we have found, we want the rightmost match
        old_find_index = 0
        #loop over all counties
        for county in counties_list:
            #look for the county, starting from the RHS
            # also convert to uppercase first
            find_index = line.upper().rfind(county.upper())
            #have we found anything? (find_index will be -1 if we haven't found anything)
            if find_index > old_find_index:
                #keep the county match
                county_match = county
                #update the rightmost index count
                old_find_index = find_index
        #have we got any matches?
        if old_find_index != 0:
            outfile += county_match+","+line
        else:
            outfile += "Unknown,"+line
            errorcount += 1
    return {"output":outfile,"errors":errorcount,"lines":linecount}




file = open("input.csv" ,'rU')
output_csv = open("output.csv",'w')
result_dict = prepend_address_with_county(file,True)
percentage_error = 100*result_dict["errors"]/result_dict["lines"]
print "%d lines processed, %d Unknown counties (%.2f%%)" % (result_dict["lines"], result_dict["errors"],percentage_error)
output_csv.write(result_dict["output"])

2 thoughts on “Identifying Irish addresses by county

Comments are closed.