Integrate extracted PDF content with django-haystack

Question

Integrate extracted PDF content with django-haystack

I extracted the PDF / DOCX content using Solr, and I decided to install some search queries using the following Solr URL dedicated to this:

http://localhost:8983/solr/select?q=Lycee

I would like to install such a request with django-haystack. I found this link that talks about the problem:

https://github.com/toastdriven/django-haystack/blob/master/docs/rich_content_extraction.rst

But there is no "FileIndex" class with django-haystack (2.0.0 beta). How can I integrate such a search into django-haystack?

+4

python solr django-haystack

Mohamed ali Dec 26 '12 at 6:20

source share

1 answer

user3470130 · Answer 1 · 2014-07-21T22:59:56+0000

The "FileIndex" mentioned in the documentation is a hypothetical subclass of haystack.indexes.SearchIndex. Here is an example:

 from haystack import indexes from myapp.models import MyFile class FileIndex(indexes.SearchIndex, indexes.Indexable): text = indexes.CharField(document=True, use_template=True) title = indexes.CharField(model_attr='title') owner = indexes.CharField(model_attr='owner__name') def get_model(self): return MyFile def index_queryset(self, using=None): return self.get_model().objects.all() def prepare(self, obj): data = super(FileIndex, self).prepare(obj) # This could also be a regular Python open() call, a StringIO instance # or the result of opening a URL. Note that due to a library limitation # file_obj must have a .name attribute even if you need to set one # manually before calling extract_file_contents: file_obj = obj.the_file.open() extracted_data = self.backend.extract_file_contents(file_obj) # Now we'll finally perform the template processing to render the # text field with *all* of our metadata visible for templating: t = loader.select_template(('search/indexes/myapp/myfile_text.txt', )) data['text'] = t.render(Context({'object': obj, 'extracted': extracted_data})) return data

So, extracted_data will be replaced by any process you came up with to extract the PDF / DOCX content. You will then update your template to include this data.

Integrate extracted PDF content with django-haystack

More articles: