Specify indexing/length units #83

dmsnell · 2020-01-07T02:08:22Z

This is a great library! Unfortunately it has been ambiguous about what input it wants to accept and what it wants to output. That is, while we know that it's "character based" we don't have a definition of "character." The Lua library even makes it clear that since Lua is unaware of Unicode then it will treat content as "as a series of bytes, not a series of characters."

This ambiguity has caused numerous problems for folks wanting to interchange delta strings and gets us into tricky situations when dealing with emoji and other characters which are encoded as surrogate pairs in UTF16.

Consider this example:

A: 🅰🅰
B: 🅰a🅰

We can all agree that what happened is that we entered a a in between the two existing 🅰 characters.

Some libraries produce this delta: =1\t+a\t=1

Python3
Python2 when compiled in wide mode

Most libraries produce this delta: =2\t+a\t=2

Python2 when compiled in narrow mode
JavaScript
Objective-C
Java

I didn't check the others. This seems like enough to highlight the disparity in indexing and length calculations.

I propose a new non-breaking change to indicate what the index and length values are measured in.

In my own work in #80 I discovered that clients are fine decoding in fromDelta() a blank insertion group.

Therefore I propose that we send blank insertion groups at the front of a delta to indicate what the indexing and length values correspond to.

There are only three realistic measurement units:

Unicode code points (probably what would have been most ideal to use from the start)
UTF-16 code units (because most platforms and languages use this internally)
bytes (because that's the most agnostic way of measuring this)

In addition we should point out that the legacy behavior is to not report measurement units.

In my proposal we'd stick a number of empty insertion groups at the front of a delta to indicate which of those measurement units we'd want, in the order above: one group to indicate Unicode (since unicode is the nominal way to think about text here); two groups to indicate UTF-16 code units (since these are two-byte characters); three groups to indicate bytes (because I don't know what to do about Lua other than to make it obvious); and no groups to indicate an unreported measurement (identical to all existing deltas).

Measurement units	Delta
Unicode code points	`+\t=1\t+a\t=1`
UTF-16 code units	`+\t+\t=2\t+a\t=2`
Bytes	`+\t+\t+\t=4+a\t=4`
Unspecified	one of the above without the prefix

Note that these diffs should (might?) work in all existing libraries to produce the same result as they would without the leading + groups. However, this gives us a chance to update fromDelta() to support the denoted measurement units and then we can slowly migrate the client libraries to support returning their deltas in a requested unit.

Aug	SEP	Oct
	20
2019	2020	2021

google / diff-match-patch

Specify indexing/length units #83

Specify indexing/length units #83

dmsnell commented Jan 7, 2020 •

edited

google / diff-match-patch

Join GitHub today

Specify indexing/length units #83

Specify indexing/length units #83

Comments

dmsnell commented Jan 7, 2020 • edited

dmsnell commented Jan 7, 2020 •

edited