.net - Indexing .PDF, .XLS, .DOC, .PPT using Lucene.NET -


i've heard of lucene.net , i've heard of apache tika. question - how index these documents using c# vs java? think issue there no .net equivalent of tika extracts relevant text these document types.

update - feb 05 2011

based on given responses, seems not native .net equivalent of tika. 2 interesting projects mentioned each interesting in own right:

  1. xapian project (http://xapian.org/) - alternative lucene written in unmanaged code. project claims support "swig" allows c# bindings. within xapian project there out-of-the-box search engine called omega. omega uses variety of open source components extract text various document types.
  2. ikvm.net (http://www.ikvm.net/) - allows java run .net. example of using ikvm run tika can found here.

given above 2 projects, see couple of options. extract text, either a) use same components omega using or b) use ikvm run tika. me, option b) seems cleaner there 2 dependencies.

the interesting part there several search engines used .net. there xapian, lucene.net or lucene (using ikvm).

update - feb 07 2011

another answer came in recommending check out ifilters. turns out, ms uses windows search office ifilters readily available. also, there pdf ifilters out there. downside implemented in unmanaged code, com interop necessary use them. found below code snippit on dotlucene.net archive (no longer active project):

using system; using system.diagnostics; using system.runtime.interopservices; using system.text;  namespace ifilter {     [flags]     public enum ifilter_init : uint     {         none = 0,         canon_paragraphs = 1,         hard_line_breaks = 2,         canon_hyphens = 4,         canon_spaces = 8,         apply_index_attributes = 16,         apply_crawl_attributes = 256,         apply_other_attributes = 32,         indexing_only = 64,         search_links = 128,         filter_owned_value_ok = 512     }      public enum chunk_breaktype     {         chunk_no_break = 0,         chunk_eow = 1,         chunk_eos = 2,         chunk_eop = 3,         chunk_eoc = 4     }      [flags]     public enum chunkstate     {         chunk_text = 0x1,         chunk_value = 0x2,         chunk_filter_owned_value = 0x4     }      [structlayout(layoutkind.sequential)]     public struct propspec     {         public uint ulkind;         public uint propid;         public intptr lpwstr;     }      [structlayout(layoutkind.sequential)]     public struct fullpropspec     {         public guid guidpropset;         public propspec psproperty;     }      [structlayout(layoutkind.sequential)]     public struct stat_chunk     {         public uint idchunk;         [marshalas(unmanagedtype.u4)] public chunk_breaktype breaktype;         [marshalas(unmanagedtype.u4)] public chunkstate flags;         public uint locale;         [marshalas(unmanagedtype.struct)] public fullpropspec attribute;         public uint idchunksource;         public uint cwcstartsource;         public uint cwclensource;     }      [structlayout(layoutkind.sequential)]     public struct filterregion     {         public uint idchunk;         public uint cwcstart;         public uint cwcextent;     }      [comimport]     [guid("89bcb740-6119-101a-bcb7-00dd010655af")]     [interfacetype(cominterfacetype.interfaceisiunknown)]     public interface ifilter     {         [preservesig]         int init([marshalas(unmanagedtype.u4)] ifilter_init grfflags, uint cattributes, [marshalas(unmanagedtype.lparray, sizeparamindex=1)] fullpropspec[] aattributes, ref uint pdwflags);          [preservesig]         int getchunk(out stat_chunk pstat);          [preservesig]         int gettext(ref uint pcwcbuffer, [marshalas(unmanagedtype.lpwstr)] stringbuilder buffer);          void getvalue(ref uintptr pppropvalue);         void bindregion([marshalas(unmanagedtype.struct)] filterregion origpos, ref guid riid, ref uintptr ppunk);     }      [comimport]     [guid("f07f3920-7b8c-11cf-9be8-00aa004b9986")]     public class cfilter     {     }      public class ifilterconstants     {         public const uint pid_stg_directory = 0x00000002;         public const uint pid_stg_classid = 0x00000003;         public const uint pid_stg_storagetype = 0x00000004;         public const uint pid_stg_volume_id = 0x00000005;         public const uint pid_stg_parent_workid = 0x00000006;         public const uint pid_stg_secondarystore = 0x00000007;         public const uint pid_stg_fileindex = 0x00000008;         public const uint pid_stg_lastchangeusn = 0x00000009;         public const uint pid_stg_name = 0x0000000a;         public const uint pid_stg_path = 0x0000000b;         public const uint pid_stg_size = 0x0000000c;         public const uint pid_stg_attributes = 0x0000000d;         public const uint pid_stg_writetime = 0x0000000e;         public const uint pid_stg_createtime = 0x0000000f;         public const uint pid_stg_accesstime = 0x00000010;         public const uint pid_stg_changetime = 0x00000011;         public const uint pid_stg_contents = 0x00000013;         public const uint pid_stg_shortname = 0x00000014;         public const int filter_e_end_of_chunks = (unchecked((int) 0x80041700));         public const int filter_e_no_more_text = (unchecked((int) 0x80041701));         public const int filter_e_no_more_values = (unchecked((int) 0x80041702));         public const int filter_e_no_text = (unchecked((int) 0x80041705));         public const int filter_e_no_values = (unchecked((int) 0x80041706));         public const int filter_s_last_text = (unchecked((int) 0x00041709));     }      ///      /// ifilter return codes     ///      public enum ifilterreturncodes : uint     {         ///          /// success         ///          s_ok = 0,         ///          /// function denied access filter file.          ///          e_accessdenied = 0x80070005,         ///          /// function encountered invalid handle, due low-memory situation.          ///          e_handle = 0x80070006,         ///          /// function received invalid parameter.         ///          e_invalidarg = 0x80070057,         ///          /// out of memory         ///          e_outofmemory = 0x8007000e,         ///          /// not implemented         ///          e_notimpl = 0x80004001,         ///          /// unknown error         ///          e_fail = 0x80000008,         ///          /// file not filtered due password protection         ///          filter_e_password = 0x8004170b,         ///          /// document format not recognised filter         ///          filter_e_unknownformat = 0x8004170c,         ///          /// no text in current chunk         ///          filter_e_no_text = 0x80041705,         ///          /// no more chunks of text available in object         ///          filter_e_end_of_chunks = 0x80041700,         ///          /// no more text available in chunk         ///          filter_e_no_more_text = 0x80041701,         ///          /// no more property values available in chunk         ///          filter_e_no_more_values = 0x80041702,         ///          /// unable access object         ///          filter_e_access = 0x80041703,         ///          /// moniker doesn't cover entire region         ///          filter_w_moniker_clipped = 0x00041704,         ///          /// unable bind ifilter embedded object         ///          filter_e_embedding_unavailable = 0x80041707,         ///          /// unable bind ifilter linked object         ///          filter_e_link_unavailable = 0x80041708,         ///          /// last text in current chunk         ///          filter_s_last_text = 0x00041709,         ///          /// last value in current chunk         ///          filter_s_last_values = 0x0004170a     }      ///      /// convenience class provides static methods extract text files using installed ifilters     ///      public class defaultparser     {         public defaultparser()         {         }          [dllimport("query.dll", charset = charset.unicode)]         private extern static int loadifilter(string pwcspath, [marshalas(unmanagedtype.iunknown)] object punkouter, ref ifilter ppiunk);          private static ifilter loadifilter(string filename)         {             object outer = null;             ifilter filter = null;              // try load corresponding ifilter             int resultload = loadifilter(filename,  outer, ref filter);             if (resultload != (int) ifilterreturncodes.s_ok)             {                 return null;             }             return filter;         }          public static bool isparseable(string filename)         {             return loadifilter(filename) != null;         }          public static string extract(string path)         {             stringbuilder sb = new stringbuilder();             ifilter filter = null;              try             {                 filter = loadifilter(path);                  if (filter == null)                     return string.empty;                  uint = 0;                 stat_chunk ps = new stat_chunk();                  ifilter_init iflags =                     ifilter_init.canon_hyphens |                     ifilter_init.canon_paragraphs |                     ifilter_init.canon_spaces |                     ifilter_init.apply_crawl_attributes |                     ifilter_init.apply_index_attributes |                     ifilter_init.apply_other_attributes |                     ifilter_init.hard_line_breaks |                     ifilter_init.search_links |                     ifilter_init.filter_owned_value_ok;                  if (filter.init(iflags, 0, null, ref i) != (int) ifilterreturncodes.s_ok)                     throw new exception("problem initializing ifilter for:\n" + path + " \n\n");                  while (filter.getchunk(out ps) == (int) (ifilterreturncodes.s_ok))                 {                     if (ps.flags == chunkstate.chunk_text)                     {                         ifilterreturncodes scode = 0;                         while (scode == ifilterreturncodes.s_ok || scode == ifilterreturncodes.filter_s_last_text)                         {                             uint pcwcbuffer = 65536;                             system.text.stringbuilder sbbuffer = new system.text.stringbuilder((int)pcwcbuffer);                              scode = (ifilterreturncodes) filter.gettext(ref pcwcbuffer, sbbuffer);                              if (pcwcbuffer > 0 && sbbuffer.length > 0)                             {                                 if (sbbuffer.length < pcwcbuffer) // should never happen, happens !                                     pcwcbuffer = (uint)sbbuffer.length;                                  sb.append(sbbuffer.tostring(0, (int) pcwcbuffer));                                 sb.append(" "); // "\r\n"                             }                          }                     }                  }             }                         {                 if (filter != null) {                     marshal.releasecomobject (filter);                     system.gc.collect();                     system.gc.waitforpendingfinalizers();                 }             }              return sb.tostring();         }     } } 

at moment, seems best way extract text documents using .net platform on windows server. help.

update - mar 08 2011

while still think ifilters way go, think if looking index documents using lucene .net, alternative use solr. when first started researching topic, had never heard of solr. so, of have not either, solr stand-alone search service, written in java on top of lucene. idea can fire solr on firewalled machine, , communicate via http .net application. solr written service , can lucene can do, (including using tika extract text .pdf, .xls, .doc, .ppt, etc), , some. solr seems have active community well, 1 thing not sure of regards lucene.net.

you can check out ifilters - there number of resources if search asp.net ifilters:

of course, there added hassle if distributing client systems, because either need include ifilters distribution , install app on machine, or lack ability extract text files don't have ifilters for.


Comments

Popular posts from this blog

javascript - Enclosure Memory Copies -

php - Replacing tags in braces, even nested tags, with regex -