Python公开课 - 全文检索模块Whoosh详解（3）

前言

在上一章中，我们使用Whoosh对文档进行索引进行了介绍，接下来，我们将阐述如何对已建立索引的文档进行检索。

1 创建Searcher对象

Searcher对象来源于Index类，我们可以非常方便的使用

with ix.searcher() as searcher:
    ...

我们可以通过lexicon(fieldname)方法，来获取对应field的词典列表

import whoosh.index as index
from whoosh.fields import *
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT)

ix = index.create_in("indexdir", schema)

writer = ix.writer()
writer.add_document(title=u"First document", path=u"/a", content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", path=u"/b", content=u"The second one is even more interesting!")
writer.commit()

ix = index.open_dir("indexdir")
with ix.searcher() as searcher:
    print(list(searcher.lexicon("content")))

输出：

[b'added', b'document', b'even', b'first', b'interesting', b'more', b'one', b'second', b've']

当然searcher中最重要的方法自然是search()咯，通过它来对关键词进行检索并匹配的的结果。

from whoosh.qparser import QueryParser

qp = QueryParser("content", schema=schema)
q = qp.parse(u"hello world")

with ix.searcher() as s:
    # 每页20个，返回第5页
    results = s.search_page(q, 5, pagelen=20)

2 关于排序

Whoosh默认的是以BM25F算法进行排序，当然，你也可以指定排序算法，例如使用TF/IDF

from whoosh import scoring

with myindex.searcher(weighting=scoring.TF_IDF()) as s:
    ...

3 关键字高亮显示

可以使用highlights()方法，来使关键字高亮显示

from whoosh.qparser import QueryParser

qp = QueryParser("title", schema=schema)
q = qp.parse(u"second")

with ix.searcher() as s:
    # 每页20个，返回第1页
    results = s.search_page(q, 1, pagelen=20)
    for hit in results:
        print(hit.highlights("title"))

输出：

<b class="match term0">Second</b> document

4 相关关键字扩展

可以使用highlights()方法，来使关键字高亮显示more_like_this()方法扩展查询关键词

with ix.searcher() as s:
    # 每页20个，返回第1页
    results = s.search_page(q, 1, pagelen=20)
    for hit in results:
        more_results = hit.more_like_this("title")
        print(more_results)

5 查询多个字段

例如你想查询的关键字可能存在title中，也可能存在content中，那么这种情况，你需要使用MultifieldParser来处理

from whoosh.qparser import MultifieldParser

mparser = MultifieldParser(["title", "content"], schema=myschema)