Screen Scraping to the Rescue

I play this online game these days. It is a casual turn based game. Nothing heavy. It is a bit addicting. Occasionally I look to compare my progress against other players. There is a ratings board on the web site. But here is the problem. Every so often, I get a message stating that I visited the ratings board too often. Then I cannot see my rank. WTF?

This is a free game. So it is not like I am losing money. But this should not be that hard. There are around 1000 total players. At any given time, only 10 of them are online. How hard can it be to support a rankings page? Yeah they are probably querying a database, and sorting my character level and experience.

What? Are they running an Excel database LOL? I bet it is MySQL. And hello? Can you cache the data please? Performance problem solved. No charge. Time to take matters into my own hands. I hear the source code for the site is available. No need for me to whine. I should just implement the cache idea and demonstrate an elegant solution.

My first step was to get a snapshot of all the ratings screens. Next I am going to code up a parser to grab the raw data out of the HTML. Then I think I will import this stuff into my own database. Might not even need to do the caching if I tune the SQL correctly. For now I may just use a free Oracle database. I could just as easily use MySQL. I think I already have an instance running on my machine right now.

This is going fun. In the end, might even need to host the game on my own site. Pwned.

Errors Give me the Log

I was trying to read an article or a blog entry. Took a while for me to get to that tab in my browser. When I did, saw the following stack trace in the browser:

Traceback (most recent call last):
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/_webapp25.py", line 714, in __call__
    handler.get(*groups)
  File "/base/data/home/apps/s~combemmu-hrd/1.366712374553545399/main.py", line 432, in get
    if self.blog(self.request.path[1:]):
  File "/base/data/home/apps/s~combemmu-hrd/1.366712374553545399/main.py", line 230, in blog
    posts = [BlogPost.get_by_key_name('p' + slug)]
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/db/__init__.py", line 1275, in get_by_key_name
    return get(keys[0], **kwargs)
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/db/__init__.py", line 1533, in get
    return get_async(keys, **kwargs).get_result()
  File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 604, in get_result
    return self.__get_result_hook(self)
  File "/base/python_runtime/python_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1450, in __get_hook
    self.check_rpc_success(rpc)
  File "/base/python_runtime/python_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1222, in check_rpc_success
    rpc.check_success()
  File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 570, in check_success
    self.__rpc.CheckSuccess()
  File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 133, in CheckSuccess
    raise self.exception
OverQuotaError: The API call datastore_v3.Get() required more quota than is available.


Yum. What is all this good stuff? Guess maybe too many people were viewing the blog entry? I got a good view of some of their directory structure. Obviously they are using Python, and also Google App Engine. I wonder if they meant for me to see all this?

More importantly, is there anything I can do with all this information? Maybe I just got some insight that could come in handy. Just tried to refresh the page again. Still spitting out the stack trace. Something is hosed.