Wednesday, November 14, 2012

When pip seems to misbehave...

I've just spent two hours or so working on a problem that should have been trivial...

Aside: I am recovering from a flu; I can conveniently heap the blame on something / someone else...


The good little boy

I try to be a good boy. I eat my spinach and carrots. I also develop in virtualenvs and see to it that my virtualenvs are not "polluted" by system packages. So I should not have any trouble with package management, right? Wrong.
The three chief virtues of a programmer are: Laziness, Impatience and Hubris. - Larry Wall, creator of the Perl programming language
I may not like Perl much, but I agree with Larry; laziness is a virtue. So, like any good lazy programmer, I write shell scripts that reduce the amount of typing that I have to do later on. Here's one trivial example ( located in an executable script called "shell", with the appropriate shebang at the top ):
/usr/bin/ipython notebook -c "run ./shell.py"
I've taken to using the ipython notebook a lot ( even for my Django projects ), and I like it to be set up  just so  - with the Django environment already set up, and the correct packages already imported so I don't have to trouble my little fingers with typing import statements when working in the notebook.

Trouble in paradise

This morning, I installed some new packages in my virtualenv. Standard, boring, regular procedure. Then I tried to run my shell / run the development server and got complaints that the packages could not be found. But they were there. Hell, I could even cd into my virtualenv's site-packages directory and see the bloody packages in there. My virtualenv was set up with --no-site-packages. Blistering barnacles.

So - what was the problem? Take a look at the shell "script" above - see any problem? Nope? Well, there is a big one; I have hard coded in a reference to the "system" ipython in /usr/bin/ipython; little wonder that I could not see the packages in my virtualenv. 

Here's the default manage.py created by the Django utilities when you start a new project:

#!/usr/bin/env python
import os
import sys

if __name__ == "__main__":
    os.environ.setdefault("DJANGO_SETTINGS_MODULE", "dummp.settings")

    from django.core.management import execute_from_command_line

    execute_from_command_line(sys.argv)

Look at the shebang on top of the file. It will resolve to the "system" Python interpreter, if run the lazy way ( ./manage.py <cmd> ) but work file if run from the virtualenv with "python manage.py <cmd>".

Update: The shebang above appears to work in virtualenvs too.

Lesson learned

Make sure the shell scripts ( and shebangs at the top of Python scripts ) point to the correct interpreter when working in a virtualenv. Otherwise, grief will ensue.

The great mystery is this - this is not a new environment. I've been working happily in this virtualenv for weeks now. Why did it choose to bite me today? I blame the flu.

Implementing access control in a multi-user multi-organization system with Django

This week, I was working on an "access control" problem in a Django app that can be framed as follows:
  • there are many "organizations" accessing the system, including one "master" or "super" organization. Users assigned to each organization need to see only their own data, but the admins on the master need to see aggregate data
  • the usual django roles and permissions need to be available for each organization, but operating within the context of the data that that "belongs" to that organization
  • many of the models do not have a direct foreign key to the organization; although, in all cases, the organization can be inferred by "following" a series of foreign keys
On top of these functional requirements, I layered on my own requirements:
  • preferred solutions that would not break if a signal handler was detached ( or a trigger accidentally got disabled )
  • preferred solutions that did the filtering in the database
Initially, I thought I could use with one of the "object permissions" libraries available for Django. I even wrote a blog post about it. But - I disqualified this approach on the the following bases:
  • each time a new user was added, I needed to have a signal handler that would retrospectively grant them permission to access objects belonging to their organization. This would make things a little brittle, and might have performance implications as the database size grew
  • they did not play well with roles and permissions. I wanted to have the standard Django permissions and roles layered on top, and used in the usual manner
My solution will be controversial to some people. I decided to write a custom manager to be shared by the models that needed this "pre-filtering". Here's what that looks like:

Controversies
This is what I expect to be controversial about this approach:

  • reliance on threadlocals ( populated by a middleware class ). These are always controversial
  • the manager needs to be updated / re-tested if the models / schema change
In defense
  • There are more than 10 tests devoted to exercising this manager. The tests have not been shown in this blog post
  • I have some awareness of the deployment "gotchas" that threadlocals introduce, especially with regard to the choice of frontend server ( a threaded or multi-process "worker" based setup? )
The million-dollar question
Is there a good way in Django to "filter by organizational ownership" while still reserving the right to use standard roles and permissions in the app? If there is a better way, I am open to receiving some education...


Wednesday, November 7, 2012

Brevity vs Clarity

Today I had to work with some of my old code. It was originally written in a hurry ( lame excuse ). It sucked. Hard.

One of the functions I had needed to walk a directory ( and its sub-directories ), pick out files whose file-names followed a specific pattern and <do something to them>. For the purposes of this post, the do something to the files bit is not relevant. This is what I had coded up originally ( to walk the directory and load the qualifying files ).

    def load_content(src_dir):
        for root, dirs, files in list(os.walk(src_dir))[:1]:
            for subdir in dirs:
                for subdir_root, subdir_dir, tip_files in os.walk(os.path.join(root, subdir)):
                    for tip_file in tip_files:
                        with open(os.path.join(subdir_root, tip_file)) as f:
                            content = yaml.load(f)
                            #  - rest of the code omitted 

What a monstrosity! I felt like ducking under the table to hide for a bit when I saw it this morning. Here's what I collapsed all that gobble-de-gook to:

    def load_tips(src_dir):
        filenames = [os.path.join(path, name) for path, subdirs, files in os.walk(src_dir) for name in files if re.match(r'^.+\d+\.yml$', name)]
        for file_name in file_names:
            with open(file_name) as f:
                content = yaml.load(f)
                #  - rest of the code omitted 

It doesn't look that much shorter, but I got rid of a whole three levels of indentation. That made a big difference to the look - and readability - of the rest of the code.

My only worry is that the list comprehension in the revised version may confuse a less experienced Python programmer. The original "dumb" version is wordy, but straight-forward. The revised version is more compact, but not as "obvious". 

So, which of the two is "Pythonic"?

Supervisor - Reloading Code

This has tripped me up two different times.

Scenario
You have a Python web app, running nicely on a server, with supervisord keeping an eye on things. Its now time to deploy an update. Should be easy as pie, right?

I tend to push my deploy git branch to the server then run the app with gunicorn in situ - in the same directory, with the privileges of my shell user, with nginx proxying it to the outside world. supervisord takes care of starting and restarting it.

I expected that I could just push the update to the git repo, do "sudo supervisorctl restart <app name>" and walk off into the sunset, whistling a merry tune. I was wrong.

It turns out that when you make significant changes to the code or startup scripts, you need to:

  1. Stop supervisord. Completely.
  2. Start it.
I'm assuming that you've already carried out sanity checks on the app - it starts and runs OK when run "by hand" with gunicorn on the shell, any wrapper shell scripts that you have are tested and working fine etc.

This is the sort of thing that is making me seriously consider automating my entire deployment workflow. 

Friday, November 2, 2012

Python, Django and Unicode Errors

So, today I had the "'ascii' codec can't encode character u'\xf8' in position..." error. Again.

As you'd expect, its been discussed at length on StackOverflow. That is how I discovered this post. Adding sitecustomize.py to my virtualenv's lib directory worked. I should have been happy, right?

Not quite. This solution works in the dev environment, but does not lend itself to repeatable deploys in production. I presently use a supervisor + gunicorn + nginx setup. A little more research uncovered this blog post. I'm really hoping that this is the last I've seen of those Python unicode / encoding issues. A man can hope, can't he?

Thursday, November 1, 2012

Audit Logging in a Django Application

I'm working on a Django application that needs to have ( for business reasons that are not relevant to this post ) "complete traceability of all actions". In other words, I need an audit trail. This is an issue that was discussed in the django dev list some years ago. The approach that came out of that discussion is available here, but it has lots of warnings and caveats. Time to look for other options.

Django packages tells me that there are 12 different packages that can help me do this. Oh, choices, choices...

django-reversion is the most active on github. django-audit-log is a distant second in terms of activity / popularity. The rest have far less activity - so I'll limit my choices to these two. django-audit-log is candid enough to admit on its home page that:
The audit log bootstraps itself on each POST, PUT or DELETE request. So it can only track changes to model instances when they are made via the web interface of your application. Note: issuing a delete in a PUT request will work without a problem (but don't do that). Saving model instances through the Django shell for instance won't reflect anything in the audit log. 
For this application, a lot of database "mutations" shall originate from background tasks. That rules out django-audit-log. django-reversion had better be good...

Going by it's docs, it looks like it will be a fit for my project. Shall it be? Lets find out tomorrow morning.


Implementing Role Based Access Control in Django

One of the projects that I am working on now needs object level permissions ie it is not enough to have an 'app_label.model_name.privilege' permission; you need to be authorized to invoke 'privilege' on a specific instance of that model.

The ( usually excellent ) Django docs have this to say on the matter: ( as at November 1 2012 )
Django's permission framework has a foundation for object permissions, though there is no implementation for it in the core. That means that checking for object permissions will always return False or an empty list (depending on the check performed). An authentication backend will receive the keyword parameters obj and user_obj for each object related authorization method and can return the object level permission as appropriate.
Mmmhh. Haven't said much there, have we?

The search for a solution

Stack Overflow is a good place to get "collective wisdom" on a problem. Most questions will have been asked, and answered. If it hasn't been asked / answered yet, you can usually get an answer is as few as five minutes. This particular question has been tackled before. The approach given in the accepted answer ( creating a custom back-end that knows how to resolve per-object permissions ) is easy to understand and to work on. In fact, that's what the Django guys had in mind when the wrote "An authentication backend will receive the keyword parameters obj and user_obj for each object related authorization method and can return the object level permission as appropriate." [ see full quote above ]

I should be happy, now that I have a solution...but its always good to poke around Github and see what others have done before...

  • I find django-permission - but it has a bold-faced warning stating that it is in development and may not work if future. The last commit was 6 months ago. 
  • django-object-permissions is technically not on github but...it has a nice overview page on PyPi and saw a release a few months ago...However, the git repo looks like it has not been touched for 2 years...
  • django-guardian has lots of forks on github, saw activity 4 months ago and has nice docs. Could it be what the doctor ordered for me?
To be clear, I'm using "last commit date", "number of forks" etc as proxy measures of project activity / life. The "row level permissions" problem has limited scope - it is entirely possible that the projects have had low activity because they reached maturity months ago.

So - which strategy shall I choose? Lets wait for tomorrow morning, when I'll code something up.