Local Database Accuracy: How Good, How Bad?

There’s an ongoing debate about the issue of local database accuracy. Accuracy, freshness and comprehensiveness are the attributes you want in a local database. There are a lot of claims and complaints about the databases but I haven’t seen anything empirical one way or another.

Today I met today with a startup in the local space that said they licensed the data of one of the big three database vendors and went “door to door” in one San Francisco Bay Area town to verify the accuracy of that information.

What did they find? Shockingly, they discovered there was something wrong (minor to major issues) with 47% of the listings! That’s a remarkably large number. I was truly surprised.

If this is reflective of general accuracy of the databases in local then this is a much larger problem than I previously thought.

About these ads

30 Responses to “Local Database Accuracy: How Good, How Bad?”

  1. AhmedF Says:

    Oh yeah. Not surprising at all.

    Even more so, you are talking about SF Bay Area, which gets a ton of attention. Think of some other large city that doesn’t have a buzzing tech industry. Think how bad the data is there?

  2. Mike Orren Says:

    We’re another startup in the local space that can more than vouch for the inaccuracy of those databases.

    Worse, even so-called locally monitored databases (daily newspaper, citysearch, judysbook et al.) aren’t much better.

    Or maybe that’s better… For those of us who think there’s a business in on-the-ground content.

  3. Greg Sterling Says:

    I think somebody who can create an alternative database does have a business.

  4. cohn Says:

    It is because no one has figured out how to collect and then maintain their data in real time.

  5. AhmedF Says:

    Collection is not a problem – I would imagine its the prohibitive cost of upkeep.

  6. Malcolm Lewis Says:

    We will only get more accurate data when local businesses have enough incentive to proactively manage their own information – either directly or through a (paid) partner.

    I have long felt there is an opportunity for someone to create a single, central repository where businesses can publish their info once and once only, and all publishers interested in using the data can subscribe to it via a simple API.

    Someone with clout, eg Google, could provide real value to local businesses by providing such a service, and as a side effect would place themselves in a strong position to up-sell local advertising solutions (website, adwords, etc.).

    I envision basic information (name, address, phone) being free to publishers, with enhanced info being available for a fee. Sharing that fee with local businesses might be an additional incentive for them to participate (above and beyond the primary benefit of centralized maintenance).

  7. Mike Orren Says:

    “We will only get more accurate data when local businesses have enough incentive to proactively manage their own information – either directly or through a (paid) partner.”

    This is the mistake that everyone makes– attributing rational behavior to a large pool of small businesses in a local market.

    I believe (and am admittedly drinking my own Kool-Aid) that a local newsgathering organization is best positioned to do this. As Ahmed says above, the collection is not as big a hurdle as upkeep. And no matter the incentives, small businesses will not update their own data in a sufficient mass to keep the database clean. It takes a hybrid of user submissions of errors and an on the ground staff that both checks the data by phone and keeps track of venues closing/opening/etc.

    The newspapers seem to think this work is beneath their journalists. There is an opportunity there….

  8. Matthew Berk Says:

    Everyone wants more accurate data, to be sure, but we think of things in a slightly different way.

    There’s a very important difference between the accuracy of data and the fitness of a model built on top of that data. What we’re doing when we use this data is building a model of the world. And the fitness of that model is determined by a) the interpretive labor set upon the underlying data, and b) the framework in which that model is made useful to the consumer for a particular need.

    While we–like everyone else–place a premium on getting better data, we also know that the really hard work is to take what’s available, shore it up with as many other sources of data as we can, and produce useful models–and products–that are still highly “fit” to the needs of the consumer.

  9. Mike Orren Says:

    I get you Matthew — But go here:

    http://www.openlist.com/all-browse.htm?vertical=restaurants&query=&loc=dallas%2C+tx&x=0&y=0

    First page has three major mistakes. Two are out of business — one has been for several years. And a third is a landmark local restaurant misspelled.

    Not a knock on y’all– There’s no way you’d know without people on the ground.

    On the whole, your listing is still better than anything available in Dallas (until we launch our restaurant listings next month).

    But as local, these kind of mistakes would turn me off the site. And as a traveler who goes to a restaurant that no longer exists?…

    Local is hard, sweaty, in-the-trenches work.

  10. Mike Orren Says:

    Interesting back-of-the-napkin calculation from our listings editors — To “touch” every restaurant in the DFW area twice-yearly to update data would (will?) take 150 phone calls/week.

    Our folks (building out the initial database) average 34 calls/manhour.

    Will be interesting to see how that number changes on updates/verification.

  11. Local Accuracy Comments Aplenty « Screenwerk Says:

    [...] My post yesterday on the accuracy of local databases has apparently struck something of a nerve. Check out the comments. [...]

  12. AhmedF Says:

    The algorithmic/crawling method has the asme problem as the search engines face right now – who do you trust, and can you *really* trust them?

    We have had incorrect results in our database. One person recently contacted me about a number we had incorrect (the correct number was 531-xxxx and his phone # was 521-xxxx). The unfortunate person had been inundated with phone calls. Turned out it wasn’t just us – Google listed the (incorrect phone #), as did other sources.

    So as I see it, an algorithmic solution is close to impossible. Every single database will have flaws. Even taking Mike’s comments – a twice a year update may sound like a lot, but is it really? What is restaurant X was ‘verified’ on January 1, and then went out of business 5 days later. For the next ~180 days a restaurant is listed in the database when it does not exist anymore.

    An idea I kicked around was an automated phone system – once a month (so that the owner does not get peeved) the system dials xxx-xxx-xxxx, says something like “Hello. I am from XXX and this is an automated check that your place is still in business. If this is YYYYY please press 1. Otherwise, press 0″.

    Viola! An ‘algorithmic’ solution to keep local addresses upto date. Of course, I myself usually hang up on such phone calls, but I am paranoid about privacy :)

  13. Mike Orren Says:

    Ahmed:

    Good points. A couple things:

    “What is restaurant X was ‘verified’ on January 1, and then went out of business 5 days later. For the next ~180 days a restaurant is listed in the database when it does not exist anymore.”

    That’s why you need people on the ground, preferably with other things to do (like news gathering). The errors I pointed out on the other listing site weren’t caught by research. They were caught because I live here and am part of an organization that makes it its business to follow such comings and goings.

    In other words, the best you can hope for on local is “semi-scalable.”

    I like your phone idea, but it doesn’t account for our biggest problem on these calls– non-English speakers. That’s a hard one to beat. (Note that we are also collecting hours of operation.)

  14. Al Marshall Says:

    I will admit to being somewhat surprised by the 47% error rate Greg referred to. However, as we all know, database accuracy has always been a problem – think back to the errors you have experienced over the years with Directory Assistance and print White Pages. However, while in the past, one could appreciate just how prohibitively expensive it would be to create accurate databases, I believe that technology has dramatically reduced that cost in recent years, and at the same time, the business opportunity has significantly increased.

    I believe we are several years past the point where a major search engine or IYP should have seen the business opportunity and began making the investment. Instead, the IYPs has been content to use what is essentially the same database, purchased at a very low cost, and instead “compete” on bells and whistles instead of focusing on what really matters – the content itself. I always felt that major publishers weren’t willing to make the multi-million dollar investments needed to begin making a real difference in data quality because it would likely take years to make a real impact on their business and there would be no way to predict what that impact would be. Personally, I feel that IYP sites have a very high churn rate amongst their user bases due to poor data quality, but because of tracking difficulties, the publishers aren’t able to figure out what this number is and the considerable benefits to be gained from even a small reduction in this user churn.

    Imagine if a publisher were to use automated tools like AhmedF describes, in conjunction with web-crawling and an overseas call center. The combination would create relatively low-cost verification and information gathering capabilities (example: Use the automated call to solicit businesses “Press ‘1’ if you would want a representative from Google to call you to collect information about your business that will be made available in our online directory”).

    The publishers, through their database of queries, already know what businesses are most requested and thus can concentrate limited resources on those businesses.

    With all the things that Google is doing, you’d think they’d get around to investing $10m/year which would give them dramatically better data than any competing publisher. Arguably, Yahoo would be the best candidate to do this since that company is more savvy about combining technology and people.

  15. AhmedF Says:

    I do have to wonder – what do they term as ‘minor.’ The databases have a lot if info you wouldn’t share publicly (eg name of owner, etc) – was that included?

  16. Greg Says:

    It’s not for nothing that all those travel guide books pay college kids to speed through 10 countries in two months to try and fact check, making sure that cool, cheap pension is still at 10 Rue Garcon. Quality local data comes from going out and getting it. Yahoo and Google both make it pretty easy for local mom and pops to update/correct their info and, especially outside the big cities, it seems that almost no one takes advantage of this.

  17. The (Un)reliability of Local Data « ConsensusBlog Says:

    [...] There’s an interesting post on the reliability (or lack thereof) of local data over on Greg Sterling’s blog that is very relevant to ConsensusBest and any other site dealing in any kind of “where” data. Helping people find the best products is just half the value we hope to provide. The other half is helping people find those products in their local stores, since most people still prefer to make most purchases live and in person. [...]

  18. No Posting Last Night: Blame Ingenio « Screenwerk Says:

    [...] Two other issues I’ve been thinking quite a lot about are: the problem of inaccurate local data and the problem of competing and divergent traffic metrics and what might be done about it. [...]

  19. James E. Johnson Says:

    From the my perspective (a 66 year old officially retired yet quite active in various data related proejcts):

    1. The most accurate databases being assembled are done so using telephone as a “best authority” for sustaining accuracy (recency of information).

    2. Other techniques including comparison of data from multiple sources that can reduce the amount/costs of producing highest accuracy/recency coupled with (good business) lowest cost.

    3. The best solution is for local, vertical or other “common interest” alliances to share the costs of maintaining accurate (recently verified) data.

    4. As a retiree, I am “more the some” available for conversation and consideration of the subject.

    – James E. Johnson, James@irmco.net, 817.881.4259

  20. Understanding Google Maps & Yahoo Local » Local Database accuracy Says:

    [...] Greg Sterling had an interesting piece on local database accuracy [...]

  21. earlpearl Says:

    I’ve optimized my biz site to do well for all sorts of long tail local phrases. If the aggregators can’t do a decent job of providing good information, local businesses will do best with optimizing their own sites and advertising agencies/local seo experts will do well for basic search

  22. Scrubbing Local Data - Tech Soapbox Says:

    [...] partook in a small but interesting discussion a while ago about how bad local data is out there. Not just bad, but also impossible to clean [...]

  23. Eric Says:

    Long tail phrases are hard though… I mean, I hear you, but I’m not sure I can wait that long.

  24. Bubba Says:

    It baffles me that this hasn’t been solved yet, and I stress yet.. It’s very funny all the solutions were technology based. That’s the main problem right there..

    The solution is so simple if you think about it. The problem is it takes time, a value and a reason. I’m tired of these”local” sites that upload that generic listing and have a “Is this your listing?” button. What good is this? So basically they upload information that everyone has and builds a pretty interface that is basically useless.

    In my opinion Local Search is a fraud as it’s implemented today and so are the main companies in this area.. Just look at Citysearch, which has to be the most useless website on the planet. But they’re “CitySearch” and the brand lives on.. look at their compete.com profile and it’ll tell the whole story. people will come, and they’ll leave much faster. Although suprisingly, they are doing it “right” and not taking advantage of it, instead they are devaluing it. And their search is a waste of time. It is implemented for them to make money and not for you to find what you are looking. Find a restaurant in the East Village for brunch that has outdoor seating. A simple search, that is made excruciatingly painful.. Sorry to get off topic, I just hate CitySearch and what they represent in the Local Industry. I find it insulting

    Anyhow, sorry to get longwinded and I know no one will probably see this post due to the fact that it’s 2 years old, but that’s why I’m writing it..

    The solution is this, it needs to be built from SCRATCH, without data collection, the right way, on one system.

    Also, much love to Greg, without you I wouldn’t be doing what I’m dong for the most part. Your blog has been enormously helpful..

    Coming Soon

  25. AhmedF Says:

    Let me know when you launch your product Bubba, you sure seem to have it figured out so easily.

  26. Greg Sterling Says:

    I hope so. Will be interesting to see what you’ve built.

  27. Bubba Says:

    Oh hey guys.. I got to say I’m a little embarrased by my comment above. I was in a bad mood yesterday and talking about these “local” search companies brings out the worst in me. I’ve been researching this market for 2 years now and it seems to be getting worse, which is fine by me though. I wish I could comment more on Gregs blogs but I need to keep things to myself, which is driving me nuts and also quite lonely. I should say this though, the only reason I know anything is from Greg and people like you Ahmed who post replies.

    One last thing and I’ll go back underground where I belong.. Ahmed, you said it seems like I figured things out so easily.. This is true… However, anyone can talk a good game, the hard part will be implementing it. We all know that a change is definately needed in this market and change is never easy.

    This was fun, thanks, and hopefully we can talk about this one day soon. I’d really love that

  28. HarvestInfo announces the launch of “IntelligentAds”. « Local Visibility Says:

    [...] Valley enriched business data could solve the “BAD DATA” issue surrounding a lot of local search [...]

  29. Anastasia Says:

    It amazes me to know that you have such a vast knowledge about this subject. Personally I think that this blog would be an eye opener to most of its readers………. This is really good work and I m very impressed………

  30. forwarded phone numbers Says:

    forwarded phone numbers…

    [...]Local Database Accuracy: How Good, How Bad? « Screenwerk[...]…

Comments are closed.


Follow

Get every new post delivered to your Inbox.

Join 123 other followers

%d bloggers like this: