Opened 13 years ago
Closed 13 years ago
#595 closed enhancement (fixed)
Allow omega to index Atom feed MIME type
Reported by: | Mihai Bivol | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.2.10 |
Component: | Omega | Version: | SVN trunk |
Severity: | minor | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
Omega doesn't currently index .atom files. I have attached a patch with a parser for atom files based on the currently existing HTML parser.
The top level title and author and are added to the document title and author. Inside entry tags, title is added to keywords.
The category tag is passed to keywords. The subtitle and summary tags are passed to dump.
Any other tags are ignored.
Attachments (3)
Change History (9)
by , 13 years ago
Attachment: | atomparse.diff added |
---|
comment:1 by , 13 years ago
Version: | → SVN trunk |
---|
I tried this out on my blog atom feeds, such as:
http://survex.com/~olly/blog/xapian/index.atom
The patch is definitely along the right lines, but I spotted some issues:
- The "author" field includes the feed uri (from the
<uri>
tag inside<author>
) which doesn't seem useful.
- The handling of
<category>
tags seems wrong - the code tries to pull out any content, but it looks like the category is in a term attribute, e.g.<category term="foo" />
- It seems to ignore
<content>
which I think probably should be treated as body text.
- It looks like
type=html
on various tags needs special handling.
- Also, it doesn't pick up any of the subtitle text in my blog, because our HTML parser doesn't handle CDATA, and just throws away the <...> it sees (this is a bug in our HTML parser really, rather than a bug in this patch):
<subtitle type="html"><![CDATA[ A blog lacking a good description. ]]></subtitle>
It'd be good to update the documentation to reflect this newly supported type too.
follow-up: 3 comment:2 by , 13 years ago
OK, I added support for CDATA to the HTML parser in r16490.
comment:3 by , 13 years ago
I have attached a patch in which I fix the issues stated before.
Category is now handled explicitly in opening_tag.
by , 13 years ago
Attachment: | atomparse.2.diff added |
---|
comment:4 by , 13 years ago
Milestone: | → 1.2.10 |
---|---|
Status: | new → assigned |
OK, I've applied your patch in r16494, along with a few tweaks, and added a ChangeLog entry in r16495. Thanks for your contribution to Xapian!
I'll attach a patch with just the extra stuff, but here's a quick breakdown:
- Update copyright headers.
- Fix indentation to 4 space indent, tab-filled with 8 space wide tabs.
- Refactor slightly to avoid the need for text_copy.
- Parse type="html" as utf-8 rather than iso-8859-1 - the RFC doesn't seem to discuss this, but the XML will usually be utf-8 (we should probably use the same charset as the Atom file uses now I think about it, I'll look at that...)
- Parse the HTML as if the charset came from a meta tag, to avoid it throwing an exception if there's a different character set specified in a meta tag in it (perhaps we should honour such an override, but we definitely don't want to die with an uncaught exception which is what happened before).
keywords = keywords + ' ' + new_keyword;
will create a couple of unnecessary temporary string objects with most compilers - using+=
orappend
avoids that.- If
type
isn't specified, we should assumetype=text
rather than using whatever the last specified type we saw on an earlier element was. - Renamed
is_escaped
tois_ignored
- Remove unneeded
#include <iostream>
- We have a preferred include order:
<config.h>
first, then the header corresponding to the source file, then other headers from our code in alphabetical order, then standard/system headers. - Updated documentation.
- Added atomparsetest with some automated tests.
Marking to backport for 1.2.10.
If you have any good ideas about the HTML charset handling, let me know.
comment:6 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Atom parser and allow Omega to index Atom files.