Blog Archive

A blog about different Django related development.

Friday, February 5, 2010

Django Full Text Search with Whoosh

Public subversion or browse dd_search source

svn co http://projects.django-development.com/svn_public/dd_devel/trunk/dd_search

This is a full text search engine based on Whoosh.


Overview


Using Whoosh the dd_search application will add full text searching to any project. It does require a minimal amount of effort when defining model classes so that the developer can limit what is indexed and what is not in the event you have some secure information you don’t wish to be queried by an anonymous user. Additionally it allows for dynamically created indexes so you can break your indexes into security zones or simply by department etc. It supports searching multiple indexes and indexing records to multiple indexes. The rest of this section goes into more detail on the subject.

Initially I was interested in making this implicite instead of explicit but realized quickly that was to risky and would potentially open to many security concerns. So following along the Pythonic mindset I chose to make indexing of models and columns explicit while minimizing the amount of work required to accomplish the feat. The configuration consists of two entries per indexed model and a single declaration in your setting.py. At least my initial design is satisfied with this level of configuration.

The easiest way to handle secure searching is to define different indexes for different types of data. Your public data can be searchable by the world so I suggest making a public index. If you additionally have internal or other higher security groups I suggest that you create additional indexes for each security area. I would never index critical columns that have credit card numbers etc. because that's one more vulnerable location to secure and not what this tool is intended for. Additionally you may want to index by other factors like data type, i.e. accounting, human resources etc. This is all accomplished with a single entry when you declare your model class by specifying dd_search_index = "index,index,index".

Additionally not all fields should be indexed because some will have very sensitive information while other may contain data that is so common as to dilute the effectiveness of a search. Whatever your reasoning you must choose the text based columns to index. This is the second and only declaration that is required to start building indexes for any model. This again is accomplished with a single entry when you declare your model class by specifying dd_search_fields = "field,field,field"

The other configuration requirement is to define a directory where you indexes will live. This needs to be writable by the account your web server is running as which is another reason to not index really sensitive information. In you settings.py file define DD_SEARCH_INDEX_DIR="/full/path/to/directory"

You should also declare the get_absolute_url and unicode methods for each model as well. Although not a requirement it make things work better since this is what is used to generate the search results.

Methods

update_index

You will not call this method it's called by the post_save signal. So every time a record is saved it will see if it needs to be indexed. It returns as quick as possible to minimize overhead if no index is defined for the model. This method lives in the models.py file so that it is loaded when django initializes before any records are save. It does the following:

1. Return right away if an index and fields are not defined for the instance.
2. Create the main index directory defined in setting if it's missing.
3. Concatenate the indexed fields into a long string for indexing.
4. Loop through indexes.
1. If index doesn't exist create it.
2. Update/Add index data


dd_search


The dd_search method takes two parameters, a csv list of indexes and the query. It will return the combined search results. This is intended to be called by your secured search routines that manage what indexes can be searched by whom.

search_internal

I also provide a secured search_internal method as an example to get you started. This simply has the login_required decorator on it to verify that a validated user is searching.

Whoosh Schema

The schema stores the relevant information to identify the specific django record it is associated with. The content is a concatenated string of the indexed columns since which column contained the information is not relevant.


WHOSH_SCHEMA = Schema(
app=STORED(),
model=STORED(),
pk=STORED(),
title=TEXT(stored=True),
url=ID(stored=True, unique=True),
content=TEXT(analyzer=StemmingAnalyzer()),
tags=KEYWORD)


I have additionally added tags to the schema in the event I add tagging to the indexing in the future which seems like a logical step.

No comments:

Post a Comment

About Me

My photo
I've recently gotten into woodworking and sustainable living. Not as a total life style change but more as a gradual growing process.