Opened 12 years ago

Closed 12 years ago

#595 closed enhancement (fixed)

Allow omega to index Atom feed MIME type

Reported by: Mihai Bivol Owned by: Olly Betts
Priority: normal Milestone: 1.2.10
Component: Omega Version: SVN trunk
Severity: minor Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Omega doesn't currently index .atom files. I have attached a patch with a parser for atom files based on the currently existing HTML parser.

The top level title and author and are added to the document title and author. Inside entry tags, title is added to keywords.

The category tag is passed to keywords. The subtitle and summary tags are passed to dump.

Any other tags are ignored.

Attachments (3)

atomparse.diff (6.0 KB ) - added by Mihai Bivol 12 years ago.
Atom parser and allow Omega to index Atom files.
atomparse.2.diff (7.2 KB ) - added by Mihai Bivol 12 years ago.
atomparse.additional.patch (12.0 KB ) - added by Olly Betts 12 years ago.
additional changes applied

Download all attachments as: .zip

Change History (9)

by Mihai Bivol, 12 years ago

Attachment: atomparse.diff added

Atom parser and allow Omega to index Atom files.

comment:1 by Olly Betts, 12 years ago

Version: SVN trunk

I tried this out on my blog atom feeds, such as:

http://survex.com/~olly/blog/xapian/index.atom

The patch is definitely along the right lines, but I spotted some issues:

  • The "author" field includes the feed uri (from the <uri> tag inside <author>) which doesn't seem useful.
  • The handling of <category> tags seems wrong - the code tries to pull out any content, but it looks like the category is in a term attribute, e.g. <category term="foo" />
  • It seems to ignore <content> which I think probably should be treated as body text.
  • It looks like type=html on various tags needs special handling.
  • Also, it doesn't pick up any of the subtitle text in my blog, because our HTML parser doesn't handle CDATA, and just throws away the <...> it sees (this is a bug in our HTML parser really, rather than a bug in this patch):
<subtitle type="html"><![CDATA[
A blog lacking a good description.
]]></subtitle>

It'd be good to update the documentation to reflect this newly supported type too.

comment:2 by Olly Betts, 12 years ago

OK, I added support for CDATA to the HTML parser in r16490.

in reply to:  2 comment:3 by Mihai Bivol, 12 years ago

I have attached a patch in which I fix the issues stated before.

Category is now handled explicitly in opening_tag.

by Mihai Bivol, 12 years ago

Attachment: atomparse.2.diff added

comment:4 by Olly Betts, 12 years ago

Milestone: 1.2.10
Status: newassigned

OK, I've applied your patch in r16494, along with a few tweaks, and added a ChangeLog entry in r16495. Thanks for your contribution to Xapian!

I'll attach a patch with just the extra stuff, but here's a quick breakdown:

  • Update copyright headers.
  • Fix indentation to 4 space indent, tab-filled with 8 space wide tabs.
  • Refactor slightly to avoid the need for text_copy.
  • Parse type="html" as utf-8 rather than iso-8859-1 - the RFC doesn't seem to discuss this, but the XML will usually be utf-8 (we should probably use the same charset as the Atom file uses now I think about it, I'll look at that...)
  • Parse the HTML as if the charset came from a meta tag, to avoid it throwing an exception if there's a different character set specified in a meta tag in it (perhaps we should honour such an override, but we definitely don't want to die with an uncaught exception which is what happened before).
  • keywords = keywords + ' ' + new_keyword; will create a couple of unnecessary temporary string objects with most compilers - using += or append avoids that.
  • If type isn't specified, we should assume type=text rather than using whatever the last specified type we saw on an earlier element was.
  • Renamed is_escaped to is_ignored
  • Remove unneeded #include <iostream>
  • We have a preferred include order: <config.h> first, then the header corresponding to the source file, then other headers from our code in alphabetical order, then standard/system headers.
  • Updated documentation.
  • Added atomparsetest with some automated tests.

Marking to backport for 1.2.10.

If you have any good ideas about the HTML charset handling, let me know.

by Olly Betts, 12 years ago

Attachment: atomparse.additional.patch added

additional changes applied

comment:5 by Olly Betts, 12 years ago

OK, I changed it to use the XML's charset in r16496.

comment:6 by Olly Betts, 12 years ago

Resolution: fixed
Status: assignedclosed

CDATA change backported for 1.2.10 in r16519, and the other 3 changes in r16520.

Note: See TracTickets for help on using tickets.