Real Ultimate Programming

The Home for People Who Like to Flip Out and Write Code

Notes From PyATL 2011-07-14

Python for Data Mining

Ad-hoc analysis typically requires 3 layers:

  • Date Extraction (SQL or a query builder)
  • Transformation & Analysis (scripting language)
  • Presentation (Excel, Powerpoint, Access (I wonder what this is for?))

Python is a great fit for the middle layer. This is for all the usual reasons: succinct, expressive, std lib is big, PyPI is bigger, readable. There is also another one: easy access to high-speed options: JIT, Cython, Numpy, etc.

PROTIP: Take snapshots from time-to-time, because the data will change on you.

One reason Python is a win is because you can do the analysis on your desktop, and IT is touchy about giving people rights to run PL/SQL on the DB.

You do a lot of templating on the queries, because they’re so verbose and repetitive.

The standard library will get you a lot farther than you think; you don’t always need to jump straight to Numpy.

YES! itertools FTW, baby.

Wow, the first example really demonstrates how dense your code can get with list comprehensions and other standard library stuff. The dataset initialization was doing an awful lot in essentially one line.

Numpy

The classic Numpy array requires you to define your datatypes, and you only get one in a given array.

Structured array lets you name columns and have different datatypes for each column.

Manipulating Data: Matplotlib

As the heading indicates, you can do more than just plot things with it; you can do some serious manipulation, too.

Rapid, Scalable Web Development with MongoDB, Ming, and Python

FossFor.us (the SourceForge black ops project to be more web 2.0) was built on CouchDB.

It didn’t scale the way they needed for SF.net; MongoDB came into play because of that.

TIL: Documents in MongoDB are limited to 4MB (now 16MB). SF.net had to rethink their initial design because of this fact.

Ming

They eventually decided they needed an “Object-Document Mapper” (hehe, they still call it an ORM): Enter Ming.

Ming allows them to define their schema and enforce it.

They also handle migrations with Ming, and they can be eager or lazy.

They have the concept of a “unit of work”, which basically allows them to log all the updates against an object, then distill down into a single update (or close to it). This can be especially handy because you don’t have multi-statement transactions in MongoDB.

You can drop out of Ming if you need to, to get better performance.

Allura

SF.net is trying to give back to the Open Source community with Allura. It’s essentially their codebase.

Zarkov

Zarkov an asynchronous TCP server for event logging with gevent, which they built on top of Ming.

Procedures, Objects, Reusability: httplib and its discontents

This is a deep-dive into how Python’s SimpleHTTPServer handles HTTP requests, which is a build-up to a rant about how the HTTP parsing code is not reusable because it’s attached to a class that is designed to be extended, not used as a utility module.

OK, wow, Brandon is stubborn. He got medieval on the standard lib. Also, he is some kind of riled up about all the hoops he jumped through.

It’s entirely possible Brandon might not ever write another class again.

Back to flipping out…