How to index PDF content with Lucene AdvancedDatabaseCrawler in Sitecore

Make sure to run ProcessPdf method when AddAllFields method is called


        private void ProcessPdf(Document document, Item item)
          {
            if (item.TemplateID != Templates.PdfUnversionedTemplateId && item.TemplateID != Templates.PdfVersionedTemplateId) return;

            string pdfText = Util.StripTagsRegexCompiled(GetPdfText(item));

            if (pdfText.IsNullOrEmpty()) return;

            ProcessField(document, BuiltinFields.Content, pdfText, Field.Store.YES, Field.Index.TOKENIZED);
        }

TemplateId’s for both versioned and unversioned PDF’s since a PDF could be based on one of them.

        public static ID PdfVersionedTemplateId {
            get {
                return new ID("CC80011D-8EAE-4BFC-84F1-67ECD0223E9E");
            }

        }
        public static ID PdfUnversionedTemplateId {
            get {
                return new ID("0603F166-35B8-469F-8123-E8D87BEDC171");
            }

        }

A helper class for getting rid of html tags inside the pdf content.

        public static class Util {
            public static string StripTagsRegexCompiled(string source) {
                return HtmlRegex.Replace(source, string.Empty);
            }

            private static readonly Regex HtmlRegex = new Regex("<.*?>", RegexOptions.Compiled);

        }

We need to call QueryParser.Escape for Special Characters.

        private static string GetPdfText(MediaItem mediaItem) {
            if (mediaItem != null) {
                string pdfText = ParsePdf(mediaItem);
                pdfText = QueryParser.Escape(pdfText);
                return pdfText;
            }
            return "";
        }

The final method for parsing the pdf.

        public static string ParsePdf(MediaItem mediaItem) {
            PDDocument pdDocument = null;
            InputStreamWrapper inputStreamWrapper = null;
            Stream stream = mediaItem.GetMediaStream();

            try {
                inputStreamWrapper = new InputStreamWrapper(stream);
                pdDocument = PDDocument.load(inputStreamWrapper);
                PDFTextStripper pdfTextStripper = new PDFTextStripper();
                string parsedPdf = pdfTextStripper.getText(pdDocument);
                Log.Info(mediaItem.Name + " was parsed by pdfbox...", typeof (Util));
                return parsedPdf;
            } catch (Exception ex) {
                Log.Error("Could not parse pdf: " + mediaItem.Name, ex, typeof (Util));
                Log.Info("We have to try to index the pdf with itextCharp then...", typeof (Util));

                PdfReader pdfReader = null;

                try {
                    Stream pdfReaderStream = mediaItem.GetMediaStream();
                    pdfReader = new PdfReader(pdfReaderStream);

                    StringWriter output = new StringWriter();
                    for (int i = 1; i <= pdfReader.NumberOfPages; i++)
                    output.WriteLine(PdfTextExtractor.GetTextFromPage(pdfReader, i,
                    new SimpleTextExtractionStrategy()));

                    Log.Info(mediaItem.Name + " was parsed by itextCharp... ", typeof (Util));
                    return output.ToString();
                } catch (Exception exception) {
                    Log.Error(" Could not parse pdf with itextCharp either " + mediaItem.Name, exception, typeof (Util));
                    return String.Empty;
                } finally {
                    if (pdfReader != null) pdfReader.Close();
                }
            } finally {
                if (pdDocument != null) pdDocument.close();

                if (inputStreamWrapper != null) inputStreamWrapper.close();
            }
        }

I am using Pdfbox and itextCharp and if anyone has a better open source module based on .Net for parsing PDF then i would love to know that.

DLL pdfbox itextCharp

Advertisements

4 thoughts on “How to index PDF content with Lucene AdvancedDatabaseCrawler in Sitecore

  1. This is great! The requirement has come up a few times, and I think we’ve talked our way out of it until now – this will be a good reference. Thanks

    1. Hello,

      do you want to bring table record as Lucene index record or there is some more data about the item and you want to include that data for search.

      Please tell more about the problem then I’ll see what i can do.

      Mortaza

  2. Thanks for the post. I used the your example to implement PDF file crawler for sitecore 6.4. But the file name is crawled but not the content. I created a new location to do pdf crawling.

    $(1)
    /sitecore/media library/files
    true
    filesystem

    $(id)
    $(id)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s