Python for Data Mining
Ad-hoc analysis typically requires 3 layers:
- Date Extraction (SQL or a query builder)
- Transformation & Analysis (scripting language)
- Presentation (Excel, Powerpoint, Access (I wonder what this is for?))
Python is a great fit for the middle layer. This is for all the usual reasons: succinct, expressive, std lib is big, PyPI is bigger, readable. There is also another one: easy access to high-speed options: JIT, Cython, Numpy, etc.
PROTIP: Take snapshots from time-to-time, because the data will change on you.
One reason Python is a win is because you can do the analysis on your desktop, and IT is touchy about giving people rights to run PL/SQL on the DB.
You do a lot of templating on the queries, because they’re so verbose and repetitive.
The standard library will get you a lot farther than you think; you don’t always need to jump straight to Numpy.
itertools FTW, baby.
Wow, the first example really demonstrates how dense your code can get
with list comprehensions and other standard library stuff. The
initialization was doing an awful lot in essentially one line.
The classic Numpy array requires you to define your datatypes, and you only get one in a given array.
Structured array lets you name columns and have different datatypes for each column.
Manipulating Data: Matplotlib
As the heading indicates, you can do more than just plot things with it; you can do some serious manipulation, too.
Rapid, Scalable Web Development with MongoDB, Ming, and Python
FossFor.us (the SourceForge black ops project to be more web 2.0) was built on CouchDB.
It didn’t scale the way they needed for SF.net; MongoDB came into play because of that.
TIL: Documents in MongoDB are limited to 4MB (now 16MB). SF.net had to rethink their initial design because of this fact.
They eventually decided they needed an “Object-Document Mapper” (hehe, they still call it an ORM): Enter Ming.
Ming allows them to define their schema and enforce it.
They also handle migrations with Ming, and they can be eager or lazy.
They have the concept of a “unit of work”, which basically allows them to log all the updates against an object, then distill down into a single update (or close to it). This can be especially handy because you don’t have multi-statement transactions in MongoDB.
You can drop out of Ming if you need to, to get better performance.
SF.net is trying to give back to the Open Source community with Allura. It’s essentially their codebase.
Procedures, Objects, Reusability:
httplib and its discontents
This is a deep-dive into how Python’s
SimpleHTTPServer handles HTTP requests, which
is a build-up to a rant about how the HTTP parsing code is not reusable
because it’s attached to a class that is designed to be extended, not
used as a utility module.
OK, wow, Brandon is stubborn. He got medieval on the standard lib. Also, he is some kind of riled up about all the hoops he jumped through.
It’s entirely possible Brandon might not ever write another class again.
Back to flipping out…