Page 1 of 1

XMLParser: possible bug in pos() method

Posted: Sat Jul 24, 2021 6:58 pm
by mrupp
Hi there
I'm using XMLParser to parse an XML (obviously) and my goal is to get the full XML of each listed item. I was having quite some troubles trying that, but in the end I could pin it down to a possible bug in the pos() method having troubles with special characters like ä, ö, ü, é, à, etc.

Here's a strongly simplified example that shows the bug:

Code: Select all

@REQUIRE "RapaGUI", {Link = True}
@REQUIRE "xmlparser", {Link = True}
@APPTITLE "XmlParser-Test"

Global xml1$ = [[<root>
<item id="Q:01">
	<dc:title>A Prayer For England</dc:title>
	<r:narrator>Sinead O'Connor</r:narrator>
</item>
<item id="Q:02">
	<dc:title>Remember</dc:title>
	<r:narrator>aouaou</r:narrator>
</item>
<item id="Q:03">
	<dc:title>Now Is The Time</dc:title>
	<r:narrator>eaeeae</r:narrator>
</item>
</root>]]

Global xml2$ = [[<root>
<item id="Q:01">
	<dc:title>A Prayer For England</dc:title>
	<r:narrator>Sinéad O'Connor</r:narrator>
</item>
<item id="Q:02">
	<dc:title>Remember</dc:title>
	<r:narrator>äöüäöü</r:narrator>
</item>
<item id="Q:03">
	<dc:title>Now Is The Time</dc:title>
	<r:narrator>éàèéàè</r:narrator>
</item>
</root>]]

Function p_EventFunc(msg)
	Switch(msg.ID)
	Case "btnStart1":
		moai.DoMethod("ctrlLog", "clear")
		p_ParseItemList(xml1$)
	Case "btnStart2":
		moai.DoMethod("ctrlLog", "clear")
		p_ParseItemList(xml2$)
	EndSwitch
EndFunction

Function p_Log(text$)
	moai.DoMethod("ctrlLog", "insert", text$ .. "\n", "bottom")
EndFunction

Function p_ParseItemList(xml$)
	p_Log("----------------------------------------\nXML to parse:\n----------------------------------------")
	p_Log(xml$)
	p_Log("\n----------------------------------------\nStart parsing:\n----------------------------------------")

	Local tracks = { }, currentIndex, line, column, currentPos, endPos, currentElement$ = "", currentAttributes$ = ""
	callbacks = {
		StartElement = Function (parser, elementName$, attributes)
			currentElement$ = elementName$
			currentAttributes = attributes
			Switch elementName$
			Case "item":
				currentIndex = ListItems(tracks)
				line, column, currentPos = parser:pos()
				currentPos = currentPos - 1 ; currentPos is on char 'i', substract 1 to include the opening <
				tracks[currentIndex] = { id = attributes.id }
			EndSwitch
		EndFunction,
		EndElement = Function (parser, elementName$)
			Switch elementName$
			Case "item":
				line, column, endPos = parser:pos()
				endPos = endPos + 6 ; endPos is on char '<', add 6 to include "/item>"
				tracks[currentIndex].item = MidStr(xml$, currentPos, endPos - currentPos)
				p_Log("Start: " .. currentpos .. ", End: " .. (endPos - currentPos))
				p_Log(tracks[currentIndex].item .. "\n")
			EndSwitch
		EndFunction
	}

	Local p = xmlparser.new(callbacks)
	p:parse(xml$)
	p:close()
	callbacks = Nil

	Return(tracks)
EndFunction

InstallEventHandler({RapaGUI = p_EventFunc})

moai.CreateApp([[<?xml version="1.0" encoding="iso-8859-1"?>
<application id="app">
	<window id="mainWindow" width="400" height="800" title="XmlParser-Test">
		<vgroup>
			<button id="btnStart1">Start parsing XML 1</button>
			<button id="btnStart2">Start parsing XML 2</button>
			<texteditor id="ctrlLog" noWrap="true" />
		</vgroup>
	</window>
</application>]])

Repeat
	 WaitEvent
Forever
The first example doesn't contain any special characters and runs just fine, but the second containing some does not:
Image Image

It seems that with every special character the internal position-counter is off by one. This adds up, and as my real XML is quite large, by the end I'm only getting rubbish...

Re: XMLParser: possible bug in pos() method

Posted: Sat Jul 24, 2021 8:04 pm
by SamuraiCrow
Doesn't there need to be a header that indicates that the encoding is UTF8 instead of ASCII?

Re: XMLParser: possible bug in pos() method

Posted: Sun Jul 25, 2021 12:39 am
by mrupp
SamuraiCrow wrote:
Sat Jul 24, 2021 8:04 pm
Doesn't there need to be a header that indicates that the encoding is UTF8 instead of ASCII?
Good idea. Unfortunately, it didn't help. I added

Code: Select all

<?xml version="1.0" encoding="utf-8"?>
to the beginning of the XMLs in my example, but with the same result as before.

Re: XMLParser: possible bug in pos() method

Posted: Wed Jul 28, 2021 10:08 pm
by airsoftsoftwair
That‘s not a bug. All position values in the xmlparser plugin are in bytes, not in characters. Luaexpat behaves the same.

Re: XMLParser: possible bug in pos() method

Posted: Thu Jul 29, 2021 10:01 am
by mrupp
airsoftsoftwair wrote:
Wed Jul 28, 2021 10:08 pm
That‘s not a bug. All position values in the xmlparser plugin are in bytes, not in characters. Luaexpat behaves the same.
Hmm, too bad (and not very useful that Luaexpat implemented it that way, imho, what would somebody need the byte count for???)... any ideas on how I could work around this issue? I guess I would need something like MidStr() that uses the byte count instead of character count as values, right?

Re: XMLParser: possible bug in pos() method

Posted: Thu Jul 29, 2021 11:26 am
by mrupp
I guess I would need something like MidStr() that uses the byte count instead of character count as values, right?
GOT IT: Changed line 75 from

Code: Select all

tracks[currentIndex].item = MidStr(xml$, currentPos, endPos - currentPos)
to

Code: Select all

tracks[currentIndex].item = MidStr(xml$, currentPos, endPos - currentPos, #ENCODING_RAW)
and now it WORKS! :D