솔트룩스 공유문서 검색시스템 "서치박스" 공개

솔트룩스에서 공유문서 검색시스템 “서치박스”를 공개했다.
4월 27일 솔트룩스 세미나에서도 발표를 한다고 한다.

”서치박스”는 외장하드나 파일서버에 있는 공유문서들을 검색할 수 있도록 구성되었으며, 하드웨어 일체형으로 설치, 관리 또한 간편하도록 구성되었다.

간단한 기능을 동영상으로 살펴보도록 하자.



솔트룩스가 잘~ 할수 있는 몇몇 기능들이 눈에 보이고 여기저기 신경을 많이 쓴 표시가 난다.
전체적으로는 잘 구성된 시스템으로 보인다. (물론 실제로 사용해봐야 정확하게 알 수 있겠지만…)

눈에 띄는 장점
  1. 텍스트마이닝 기능들(군집, 요약, 추출, 유사문서검색)은 장점으로 이야기 할 수 있을 것 같다.
  2. 하드웨어 일체형으로 비용, 관리, 설치를 쉽게 할 수 있도록 하고 있다.
  3. 얼마나 사용될지는 모르겠지만, 문서내의 “이미지”, “그래프”, “표”등의 오브젝트를 검색할 수 있는 기능이 있다.
  4. 문서이력, 버전관리 기능이 있다.

몇가지 생각할 점……

  1. 아무래도 공유폴더 검색의 핵심은 ACL(문서접근권한)이 될 것 같다.
    하지만, “서치박스”의 기능 설명으로 보면 “문서다운로드시” 사용자권한을 체크한다고 되어 있는데…
    이는 검색시에는 권한이 없는 사용자일지라도 모든 문서를 검색할 수 있다는 얘기가 된다.
    이는, 보안에는 좀 치명적이지 않을까?
    기술적인 어려움은 알겠는데 어렵다고 넘길 부분은 아닌 것 같다.
  2. 일반적으로 “통합검색”을 구현할 때 “파일서버”도 같이 검색될 수 있도록 구현되지 않나?
    따라서, “파일서버”만 별도로 검색되어지는 시스템은 아무래도 시장성이 떨어질 것 같다는…
  3. 일반 데스크탑검색 패키지로도 공유문서 검색같은 것은 구현할 수 있지 않을까?
  4. 약 2년전부터 패키지 형태로 유사제품을 판매하고 있는 K업체에 비해 시장진입이 너무 늦은 것 아닌것 아닌가?

by 슈퍼맨 | 2009/04/21 16:55 | 기업검색 | 트랙백 | 덧글(2)

검색엔진 개발자 그룹

검색엔진과 관련된 사람이나, 검색엔진을 개발해 보려고 하는 학생들이 많이 찾는 사이트중에 검색엔진 개발자 그룹(irgroup)이라는 것이 있다.
물론, 나도 여기 회원이고 주로 많은 정보를 얻는다.

해외에도 이와 비슷한 그룹(?)이 있지 않을까?

있다.

search_dev ( Independent Search Engine Developers)

irgroup과 다른점은 웹사이트가 아니고 뉴스그룹이라는 점과, irgroup이 검색엔진 개발에 촛점이 맞추어져 있다면 search_dev는 주요 벤더의 검색엔진 활용에 촛점이 맞춰져 있다는 점이다.

상당히 열성적인 토론이 이루어지고 있으니 관심있으신 분들은 방문해 보시길...

관심링크 : Japanese search in autonomy

by 슈퍼맨 | 2009/03/13 17:07 | 트랙백 | 덧글(0)

Information Access Cross-Check - 2009

CMS Watch에서 발표한 올해의 검색엔진 업체에 대한 평가 자료이다.
가트너에서 작년 가을에 발표자료와 비교해 보면 몇가지 다른점을 느낄 수 있을 것이다.

by 슈퍼맨 | 2009/03/04 15:29 | 기업검색 | 트랙백 | 덧글(0)

Open Source Filter - Tika

검색엔진에서 색인 대상으로 하는 문서는 일반적으로 TEXT문서이다.
따라서, MS Word, PDF 문서와 같이 binary로 되어 있는 문서는 Filter라는 모듈을 거쳐 text문서 형태로 뽑아내게 된다.

우리나라의 많은 검색엔진들은 사이냅소프트의 필터 모듈을 정기계약 형식으로 사용하고 있다.
포탈검색이나 기업검색 모두에서 오랜시간동안 꽤 안정적인 성능을 보여주고 있다고 할 수 있다.

Open Source쪽에서도 이러한 모듈이 당연히 필요하지 않겠는가?

그래서 Apache Lucene의 sub project로 Tika가 개발되고 있으며, 현재 0.2버전까지 나온 상황이다.
(어찌보면) 당연하게 한글, 훈민정흠, 정음글로벌과 같은 문서 포맷은 지원 되지 않지만, Lucene이나 Solr을 사용하는 상황에서 MS계열의 문서들이나 PDF문서들만을 필터링하려고 하면 고려해 볼만 하지 않을까?

현재 지원되는 문서포맷
-------------

Microsoft's OLE 2 Compound Document format

A number of Microsoft applications, most notably the Microsoft Office suite, use the generic OLE 2 Compound Document format as the basis of their document formats. Tika uses Apache POI to support a number of these formats.

The OLE2 Compound Document format is designed for use with random access files, and so the input stream passed to a Tika parser needs to be spooled in memory or in a temporary file depending on the size of the document. See TIKA-153 for an effort to avoid this extra temporary file if the input document already comes from a file.

In addition to the shared base format there's also a shared sets of metadata in typical OLE2 documents. Tika uses the HPSF library from POI to parse these property sets and exposes them as the following document metadata:

  • TITLE Title
  • SUBJECT Subject
  • AUTHOR Author
  • KEYWORDS Keywords
  • COMMENTS Comments
  • TEMPLATE Template
  • LAST_SAVED Last Saved By
  • REVISION_NUMBER Revision Number
  • LAST_PRINTED Last Printed
  • LAST_SAVED Last Saved Time/Date
  • LAST_SAVED Last Saved Time/Date
  • PAGE_COUNT Number of Pages
  • WORD_COUNT Number of Words
  • CHARACTER_COUNT Number of Characters
  • APPLICATION_NAME Name of Creating Application

Note that in practice the metadata in many documents is either missing, incomplete or even incorrect, so a client application should not rely too much on this information.

Support for the new Office Open XML format used by Microsoft Office version 2007 is pending for a POI upgrade. Current status is recorded in TIKA-152 .

The generic OLE2 Compound Document format is automatically detected using a magic number, and further parsing can automatically determine the more specific document format. Tika also knows a number of common glob patterns like *.doc and *.ppt for these formats.

The supported OLE 2 Compound Document formats are:

Microsoft Excel (application/vnd.ms-excel)
Excel spreadsheet support is available in all versions of Tika and is based on the HSSF library from POI.

The Excel parser in Tika uses the HSSF event API and is able to extract much of the document structure, including all (non-empty) worksheets and their table structures. Formula results are extracted as stored in the Excel file, and cell links are exposed as XHTML links. These features were added in Tika version 0.2.

Cell comments and formatting are currently not supported. See TIKA-148 and TIKA-103 for the respective issues.

See the ExcelParserTest test case for an example of parsing Microsoft Excel files.

Microsoft Word (application/msword)
Word document support is available in all versions of Tika and is based on the HWPF library from POI.

The Word parser uses the WordExtractor class from HWPF to extract document content as a sequence of paragraphs.

See the WordParserTest test case for an example of parsing Microsoft Word files.

Microsoft PowerPoint (application/vnd.ms-powerpoint)
PowerPoint presentation support is available in all versions of Tika and is based on the HSLF library from POI.

The PowerPoint parser uses the PowerPointExtractor class from HSLF to extract spreadsheet content as a single paragraph.

See the PowerPointParserTest test case for an example of parsing Microsoft PowerPoint files.

Microsoft Visio (application/vnd.visio)
Visio diagram support was added in Tika version 0.2 and is based on the HDGF library from POI.

The Visio parser uses the VisioExtractor class from HDGF to extract diagram content as a sequence of paragraphs.

Microsoft Outlook (application/vnd.ms-outlook)
Outlook message support was added in Tika version 0.2 and is based on the HSMF library from POI.

The Outlook parser extracts the subject of the message and the From, To, Cc, and Bcc addresses (formatted for display) along with the body text of text/plain messages. The AUTHOR , TITLE and SUBJECT metadata properties are set explicitly, overriding potential generic document metadata retrieved from OLE2 property sets.

Compression formats

General purpose compression formats are used to reduce the size of any kinds of documents. Tika uses a parsing pipeline to support general purpose compression: in the first stage the compressed stream decompressed and the resulting decompressed stream is passed on to a second parsing stage where it will be processed as if the document had never been compressed.

Tika contains magic numbers and glob patterns for auto-detecting all supported compression formats. The glob patterns of compression formats are also used to determine the name of the original uncompressed document. If a client application has supplied a RESOURCE_NAME_KEY metadata property that matches such a glob pattern, then the decompressing first parsing stage will replace the RESOURCE_NAME_KEY metadata property with the deduced original document name before passing control to the second parsing stage.

Note that apart from the special handling of the RESOURCE_NAME_KEY property, no document metadata is passed to or from the second parsing stage. Only the text content extracted by the second stage parser is returned to the client application.

The supported compression formats are:

gzip compression (application/x-gzip)
Gzip support was added in Tika version 0.2 and is based on the GZIPInputStream class in the Java 5 class library.

The known gzip glob patterns are *.tgz , *.gz and *-gz , and they will respectively be replaced with *.tar , * and * as described above.

bzip2 compression (application/x-bzip)
Bzip2 support was added in Tika version 0.2 and is based on bzip2 parsing code from Apache Ant , which in turn was originally based on work by Keiron Liddle from Aftex Software.

The known bzip2 glob patterns are *.tbz , *.tbz2 , *.bz and *.bz2 , and they will respectively be replaced with *.tar , *.tar , * and * as described above.

Other supported formats
Extensible Markup Language (application/xml)
Tika uses the javax.xml classes to parse Extensible Markup Language files. Support for Extensible Markup Language files was added in Tika 0.1.
HyperText Markup Language (text/html)
Tika uses the CyberNeko library to parse HyperText Markup Language files. Support for HyperText Markup Language files was added in Tika 0.1.
Images (image/*)
Tika uses the javax.imageio classes to extract Metadata from Image files. Support for Image files was added in Tika 0.2.
Java class files
The parsing of Java Class files is based on the asm library and work by Dave Brosius in JCR-1522. Support for Java Class files was added in Tika 0.2.
Java jar archives
The parsing of Java JAR archives is performed using a combination of the ZIP and Java class file parsers. Support for Java JAR archives was added in Tika 0.2.
MP3 Audio (audio/mp3)
The parsing of ID3v1 tags from MP3 files was added in Tika version 0.2. If found the following metadata is extracted and set:
  • TITLE Title
  • SUBJECT Subject

The above information, as well as the Album , Track , Year , Genre and additional Comment are extracted when set in the file.

OpenDocument (application/vnd.oasis.opendocument.*)
TODO
Plain text (text/plain)
Tika uses the International Components for Unicode Java library (ICU4J) to parse plain text. Support for plain text was added in Tika 0.1.

Extracting text content from plain text files is actually a relatively complex task due to the fact that the character encoding of the text file is often unknown to the parser.

The text parser in Tika uses the ICU4J CharsetDetector class to automatically detect the character encoding of any text input. As an added benefit, the ICU4J library is in some cases able to detect also the language in which the text is written.

The character encoding and language of the plain text document are returned as the Metadata.CONTENT_ENCODING and Metadata.LANGUAGE metadata properties. If the (declared) content encoding of a text document is already known to the client application, then it can be supplied as the Metadata.CONTENT_ENCODING metadata property to the parser to simplify encoding detection.

Portable Document Format (application/pdf)
Tika uses the PDFBox library to parse Portable Document Format (PDF) documents. Support for PDF was added in Tika 0.1.
Rich Text Format (application/rtf)
Tika uses Java's built-in Swing library to parse Rich Text Format (RTF) documents. Support for RTF was added in Tika 0.1.

The RTF parser in Tika uses the Swing RTFEditorKit class to extract all text from an RTF document as a single paragraph. Document metadata extraction is currently not supported.

tar archive (application/x-tar)
Tika uses an adapted version of the tar parsing code from Apache Ant to parse tar archives. The tar code is originally based on work by Timothy Gerard Endres. Support for tar archives was added in Tika 0.2.
ZIP archive (application/zip)
Tika uses Java's built-in Zip classes to parse ZIP files. Support for ZIP was added in Tika 0.2.

by 슈퍼맨 | 2009/03/04 15:06 | SOLR | 트랙백 | 덧글(0)

◀ 이전 페이지다음 페이지 ▶