* Subclasses must define (from {@link PySequence}): *
* Many of the methods implemented here are inherited or thinly wrapped by {@link PyByteArray}, * which offers them as Java API, or exposes them as Python methods. These prototype Python methods * mostly accept a {@link PyObject} as argument, where you might have expected a {@code byte[]} or * {@code BaseBytes}, in order to accommodate the full range of types accepted by the Python * equivalent: usually, any {@code PyObject} that implements {@link BufferProtocol}, providing a * one-dimensional array of bytes, is an acceptable argument. In the documentation, the reader will * often see the terms "byte array" or "object viewable as bytes" instead of {@code BaseBytes} when * this broader scope is intended. *
* Where the methods return a {@code BaseBytes}, this is will normally be an instance of the class
* of the object on which the method was actually called. For example {@link #capitalize()}, defined
* in {@code BaseBytes} to return a BaseBytes, actually returns a {@link PyByteArray} when applied
* to a {@code bytearray}. Or it may be that the method returns a {@code PyList} of instances of the
* target type, for example {@link #rpartition(PyObject)}. This is achieved by the sub-class
* defining {@link #getslice(int, int, int)} and {@link #getResult(Builder)} to return instances of
* its own type. See the documentation of the particular methods for more information.
*/
@Untraversable
public abstract class BaseBytes extends PySequence implements List
* 1 2 3
* 0123456789012345678901234567890
* Text: a man, a map, a panama canal
* Pattern: panama
*
*
* This puts the 'p' of 'map' against the last byte 'a' of the pattern. Rather than testing the
* pattern, we will look up 'p' in the skip table. There is a 'p' just 5 steps from the end of
* the pattern, so we will move the pattern 5 places to the right before trying to match it.
* This allows us to move in large strides through the text.
*/
protected static class Finder {
/**
* Construct a Finder object that may be used (repeatedly) to find matches with the pattern
* in text (arrays of bytes).
*
* @param pattern A vew that presents the pattern as an array of bytes
*/
public Finder(PyBuffer pattern) {
this.pattern = pattern;
}
/**
* Mask defining how many of the bits of each byte are used when looking up the skip, used
* like: {@code skip = skipTable[MASK & currentByte]}.
*/
private static final byte MASK = 0x1f;
/**
* Table for looking up the skip, used like: {@code skip = skipTable[MASK & currentByte]}.
*/
protected int[] skipTable = null;
/**
* This method creates a compressed table of bad-character skips from the pattern. The entry
* for a given byte value tells us how far it is from the end of the pattern, being 0 for
* the actual last byte, or is equal to the length of the pattern if the byte does not occur
* in the pattern. The table is compressed in that only the least-significant bits of the
* byte index are used. In the case where 5 bits are used, the table is only 32 elements
* long, rather than (as it might be) 256 bytes, the number of distinct byte values.
*/
protected int[] calculateSkipTable() {
int[] skipTable = new int[MASK + 1];
int m = pattern.getLen();
// Default skip is the pattern length: for bytes not in the pattern.
Arrays.fill(skipTable, m);
// For each byte in the pattern, make an entry for how far it is from the end.
// The last occurrence of the byte value prevails.
for (int i = 0; i < m; i++) {
skipTable[MASK & pattern.byteAt(i)] = m - i - 1;
}
return skipTable;
}
/**
* Set the text to be searched in successive calls to {@code nextIndex()}, where the text is
* the entire array {@code text[]}.
*
* @param text to search
*/
public void setText(byte[] text) {
setText(text, 0, text.length);
}
/**
* Set the text to be searched in successive calls to {@code nextIndex()}, where the text is
* the entire byte array {@code text}.
*
* @param text to search
*/
public void setText(BaseBytes text) {
setText(text.storage, text.offset, text.size);
}
/**
* Set the text to be searched in successive calls to {@code nextIndex()}, where the text is
* effectively only the bytes {@code text[start]} to {@code text[start+size-1]} inclusive.
*
* @param text to search
* @param start first position to consider
* @param size number of bytes within which to match
*/
public void setText(byte[] text, int start, int size) {
this.text = text;
this.left = start;
right = start + size - pattern.getLen() + 1; // Last pattern position + 1
/*
* We defer computing the table from construction to this point mostly because
* calculateSkipTable() may be overridden, and we want to use the right one.
*/
if (pattern.getLen() > 1 && skipTable == null) {
skipTable = calculateSkipTable();
}
}
protected final PyBuffer pattern;
protected byte[] text = emptyStorage; // in case we forget to setText()
protected int left = 0; // Leftmost pattern position to use
protected int right = 0; // Rightmost pattern position + 1
/**
* Return the index in the text array where the preceding pattern match ends (one beyond the
* last character matched), which may also be one beyond the effective end ofthe text.
* Between a call to setText() and the first call to {@code nextIndex()} return the start
* position.
* * The following idiom may be used: * *
* f.setText(text);
* int p = f.nextIndex();
* int q = f.currIndex();
* // The range text[p:q] is the matched segment.
*
*
* @return index beyond end of last match found, i.e. where search will resume
*/
public int currIndex() {
return left;
}
/**
* Find the next index in the text array where the pattern starts. Successive calls to
* {@code nextIndex()} return the successive (non-overlapping) occurrences of the pattern in
* the text.
*
* @return matching index or -1 if no (further) occurrences found
*/
public int nextIndex() {
int m = pattern.getLen();
if (skipTable != null) { // ... which it will not be if m>1 and setText() was called
/*
* Boyer-Moore-Horspool Algorithm using a Bloom array. Based on CPython stringlib,
* but without avoiding a proper bad character skip array.
*/
for (int i = left; i < right; /* i incremented in loop */) {
/*
* Unusually, start by looking up the skip. If text[i+m-1] matches, skip==0,
* although it will also be zero if only the least-significant bits match.
*/
int skip = skipTable[MASK & text[i + (m - 1)]];
if (skip == 0) {
// Possible match, but we only used the least-significant bits: check all
int j, k = i;
for (j = 0; j < m; j++) { // k = i + j
if (text[k++] != pattern.byteAt(j)) {
break;
}
}
// If we tested all m bytes, that's a match.
if (j == m) {
left = k; // Start at text[i+m] next time we're called
return i;
}
// It wasn't a match: advance by one
i += 1;
} else {
/*
* The last byte of the pattern does not match the corresponding text byte.
* Skip tells us how far back down the pattern is a potential match, so how
* far it is safe to advance before we do another last-byte test.
*/
i += skip;
}
}
} else if (m == 1) {
// Special case of single byte search
byte b = pattern.byteAt(0);
for (int i = left; i < right; i++) {
if (text[i] == b) {
left = i + 1; // Start at text[i+1] next time we're called
return i;
}
}
} else {
// Special case of search for empty (m==0) byte string
int i = left;
if (i <= right) {
// It is an honorary match - even when left==right
left = i + 1;
return i;
}
}
// All sections fall out here if they do not find a match (even m==0)
return -1;
}
/**
* Count the non-overlapping occurrences of the pattern in the text.
*
* @param text to search
* @return number of occurrences
*/
public int count(byte[] text) {
return count(text, 0, text.length, Integer.MAX_VALUE);
}
/**
* Count the non-overlapping occurrences of the pattern in the text, where the text is
* effectively only the bytes {@code text[start]} to {@code text[start+size-1]} inclusive.
*
* @param text to search
* @param start first position to consider
* @param size number of bytes within which to match
* @return number of occurrences
*/
public int count(byte[] text, int start, int size) {
return count(text, start, size, Integer.MAX_VALUE);
}
/**
* Count the non-overlapping occurrences of the pattern in the text, where the text is
* effectively only the bytes {@code text[start]} to {@code text[start+size-1]}.
*
* @param text to search
* @param start first position to consider
* @param size number of bytes within which to match
* @param maxcount limit to number of occurrences to find
* @return number of occurrences
*/
public int count(byte[] text, int start, int size, int maxcount) {
setText(text, start, size);
int count = 0;
while (count < maxcount && nextIndex() >= 0) {
count++;
}
return count;
}
}
/**
* This class is the complement of {@link Finder} and implements the Boyer-Moore-Horspool
* Algorithm adapted for right-to-left search for a pattern in byte arrays.
*/
protected static class ReverseFinder extends Finder {
/**
* Construct a ReverseFinder object that may be used (repeatedly) to find matches with the
* pattern in text (arrays of bytes).
*
* @param pattern A vew that presents the pattern as an array of bytes
*/
public ReverseFinder(PyBuffer pattern) {
super(pattern);
}
/**
* Mask defining how many of the bits of each byte are used when looking up the skip, used
* like: {@code skip = skipTable[MASK & currentByte]}.
* * Note that the way this is written at the moment, if {@code MASK} is different from * {@code super.MASK} {@code calculateSkipTable()} and {@code nextIndex()} must both be * overridden consistently to use the local definition. */ private static final byte MASK = 0x1f; /** * This method creates a compressed table of bad-character skips from the pattern for * reverse-searching. The entry for a given byte value tells us how far it is from the start * of the pattern, being 0 for the actual first byte, or is equal to the length of the * pattern if the byte does not occur in the pattern. The table is compressed in that only * the least-significant bits of the byte index are used. In the case where 5 bits are used, * the table is only 32 elements long, rather than (as it might be) 256 bytes, the number of * distinct byte values. */ @Override protected int[] calculateSkipTable() { int[] skipTable = new int[MASK + 1]; int m = pattern.getLen(); // Default skip is the pattern length: for bytes not in the pattern. Arrays.fill(skipTable, m); // For each byte in the pattern, make an entry for how far it is from the start. // The last occurrence of the byte value prevails. for (int i = m; --i >= 0;) { skipTable[MASK & pattern.byteAt(i)] = i; } return skipTable; } /** * * @return the new effective end of the text */ @Override public int currIndex() { return right + pattern.getLen() - 1; } /** * Find the next index in the text array where the pattern starts, but working backwards. * Successive calls to {@code nextIndex()} return the successive (non-overlapping) * occurrences of the pattern in the text. * * @return matching index or -1 if no (further) occurrences found */ @Override public int nextIndex() { int m = pattern.getLen(); if (skipTable != null) { // ... which it will not be if m>1 and setText() was called /* * Boyer-Moore-Horspool Algorithm using a Bloom array. Based on CPython stringlib, * but without avoiding a proper bad character skip array. */ for (int i = right - 1; i >= left; /* i decremented in loop */) { /* * Unusually, start by looking up the skip. If text[i] matches, skip==0, * although it will also be zero if only the least-significant bits match. */ int skip = skipTable[MASK & text[i]]; if (skip == 0) { // Possible match, but we only used the least-significant bits: check all int j, k = i; for (j = 0; j < m; j++) { // k = i + j if (text[k++] != pattern.byteAt(j)) { break; } } // If we tested all m bytes, that's a match. if (j == m) { right = i - m + 1; // Start at text[i-m] next time we're called return i; } // It wasn't a match: move left by one i -= 1; } else { /* * The first byte of the pattern does not match the corresponding text byte. * Skip tells us how far up the pattern is a potential match, so how far * left it is safe to move before we do another first-byte test. */ i -= skip; } } } else if (m == 1) { // Special case of single byte search byte b = pattern.byteAt(0); for (int i = right; --i >= left;) { if (text[i] == b) { right = i; // Start at text[i-1] next time we're called return i; } } } else { // Special case of search for empty (m==0) byte string int i = right; if (--i >= left) { // It is an honorary match - even when right==left right = i; return i; } } // All sections fall out here if they do not find a match (even m==0) return -1; } } /** * Class for quickly determining whether a given byte is a member of a defined set. this class * provides an efficient mechanism when a lot of bytes must be tested against the same set. */ protected static class ByteSet { protected final long[] map = new long[4]; // 256 bits /** * Construct a set from a byte oriented view. * * @param bytes to be in the set. */ public ByteSet(PyBuffer bytes) { int n = bytes.getLen(); for (int i = 0; i < n; i++) { int c = bytes.intAt(i); long mask = 1L << c; // Only uses low 6 bits of c (JLS) int word = c >> 6; map[word] |= mask; } } /** * Test to see if the byte is in the set. * * @param b value of the byte * @return true iff b is in the set */ public boolean contains(byte b) { int word = (b & 0xff) >> 6; long mask = 1L << b; // Only uses low 6 bits of b (JLS) return (map[word] & mask) != 0; } /** * Test to see if the byte (expressed as an integer) is in the set. * * @param b integer value of the byte * @return true iff b is in the set * @throws ArrayIndexOutOfBoundsException if b%gt;255 or b<0 */ public boolean contains(int b) { int word = b >> 6; long mask = 1L << b; // Only uses low 6 bits of b (JLS) return (map[word] & mask) != 0; } } /** * Convenience routine producing a ValueError for "empty separator" if the PyBuffer is of an * object with zero length, and returning the length otherwise. * * @param separator view to test * @return the length of the separator * @throws PyException if the PyBuffer is zero length */ protected final static int checkForEmptySeparator(PyBuffer separator) throws PyException { int n = separator.getLen(); if (n == 0) { throw Py.ValueError("empty separator"); } return n; } /** * Return the index [0..size-1] of the leftmost byte not matching any in {@code byteSet}, or * {@code size} if they are all strippable. * * @param byteSet set of byte values to skip over * @return index of first unstrippable byte */ protected int lstripIndex(ByteSet byteSet) { int limit = offset + size; // Run up the storage checking against byteSet (or until we hit the end) for (int left = offset; left < limit; left++) { // Check against the byteSet to see if this is one to strip. if (!byteSet.contains(storage[left])) { // None of them matched: this is the leftmost non-strippable byte return left - offset; } } // We went through the whole array and they can all be stripped return size; } /** * Return the index [0..size-1] of the leftmost non-whitespace byte, or {@code size} if they are * all whitespace. * * @return index of first non-whitespace byte */ protected int lstripIndex() { int limit = offset + size; // Run up the storage until non-whitespace (or hit end) for (int left = offset; left < limit; left++) { if (!isspace(storage[left])) { return left - offset; } } // We went through the whole array and they are all whitespace return size; } /** * Return the index [0..size-1] such that all bytes from here to the right match one in * {@code byteSet}, that is, the index of the matching tail, or {@code size} if there is no * matching tail byte. * * @param byteSet set of byte values to strip * @return index of strippable tail */ protected int rstripIndex(ByteSet byteSet) { // Run down the storage checking the next byte against byteSet (or until we hit the start) for (int right = offset + size; right > offset; --right) { // Check against the byteSet to see if this is one to strip. if (!byteSet.contains(storage[right - 1])) { // None of them matched: this is the first strippable byte in the tail return right - offset; } } // We went through the whole array and they can all be stripped return 0; } /** * Return the index [0..size-1] such that all bytes from here to the right are whitespace, that * is, the index of the whitespace tail, or {@code size} if there is no whitespace tail. * * @return index of strippable tail */ protected int rstripIndex() { // Run down the storage until next is non-whitespace (or hit start) for (int right = offset + size; right > offset; --right) { if (!isspace(storage[right - 1])) { return right - offset; } } // We went through the whole array and they are all whitespace return size; } /** * Ready-to-expose implementation of Python {@code count( sub [, start [, end ]] )}. Return the * number of non-overlapping occurrences of {@code sub} in the range [start, end]. Optional * arguments {@code start} and {@code end} (which may be {@code null} or {@code Py.None}) are * interpreted as in slice notation. * * @param sub bytes to find * @param ostart of slice to search * @param oend of slice to search * @return count of occurrences of sub within this byte array */ final int basebytes_count(PyObject sub, PyObject ostart, PyObject oend) { try (PyBuffer vsub = getViewOrError(sub)) { Finder finder = new Finder(vsub); // Convert [ostart:oend] to integers int[] index = indicesEx(ostart, oend); // [ start, end, 1, end-start ] // Make this slice the thing we count within. return finder.count(storage, offset + index[0], index[3]); } } /** * Ready-to-expose implementation of Python {@code find( sub [, start [, end ]] )}. Return the * lowest index in the byte array where byte sequence {@code sub} is found, such that * {@code sub} is contained in the slice {@code [start:end]}. Arguments {@code start} and * {@code end} (which may be {@code null} or {@code Py.None} ) are interpreted as in slice * notation. Return -1 if {@code sub} is not found. * * @param sub bytes to find * @param ostart of slice to search * @param oend of slice to search * @return index of start of occurrence of sub within this byte array */ final int basebytes_find(PyObject sub, PyObject ostart, PyObject oend) { try (PyBuffer vsub = getViewOrError(sub)) { Finder finder = new Finder(vsub); return find(finder, ostart, oend); } } /** * Almost ready-to-expose implementation of Python class method {@code fromhex(string)}. This * assigns a value to the passed byte array object from a string of two-digit hexadecimal * numbers. Spaces (but not whitespace in general) are acceptable around the numbers, not * within. Non-hexadecimal characters or un-paired hex digits raise a {@code ValueError}. * Example: * *
* bytearray.fromhex('B9 01EF') -> * bytearray(b'\xb9\x01\xef')."
*
*
* @param result to receive the decoded values
* @param hex specification of the bytes
* @throws PyException {@code ValueError} if non-hex characters, or isolated ones, are
* encountered
*/
static void basebytes_fromhex(BaseBytes result, String hex) throws PyException {
final int hexlen = hex.length();
result.newStorage(hexlen / 2); // Over-provides storage if hex has spaces
// We might produce a ValueError with this message.
String fmt = "non-hexadecimal number found in fromhex() arg at position %d";
// Output pointer in the result array
byte[] r = result.storage;
int p = result.offset;
/*
* When charAt(i) is a hex digit, we will always access hex.charAt(i+1), and catch the
* exception if that is beyond the end of the array.
*/
for (int i = 0; i < hexlen; /* i incremented in loop by 1 or 2 */) {
char c = hex.charAt(i++);
if (c != ' ') {
try {
// hexDigit throws IllegalArgumentException if non-hexadecimal character found
int value = hexDigit(c);
c = hex.charAt(i++); // Throw IndexOutOfBoundsException if no second digit
value = (value << 4) + hexDigit(c);
r[p++] = (byte) value;
} catch (IllegalArgumentException e) {
throw Py.ValueError(String.format(fmt, i - 1));
} catch (IndexOutOfBoundsException e) {
throw Py.ValueError(String.format(fmt, i - 2));
}
}
}
result.size = p - result.offset;
}
/**
* Translate one character to its hexadecimal value.
*
* @param c to translate
* @return value 0-15
* @throws IllegalArgumentException if c is not '0-'9', 'A'-'F' or 'a'-'f'.
*/
private static int hexDigit(char c) throws IllegalArgumentException {
int result = c - '0';
if (result >= 0) {
if (result < 10) { // digit
return result;
} else {
// If c is a letter, c & 0xDF is its uppercase.
// If c is not a letter c & 0xDF is still not a letter.
result = (c & 0xDF) - 'A';
if (result >= 0 && result < 6) { // A-F or a-f
return result + 10;
}
}
}
throw new IllegalArgumentException();
}
/**
* Almost ready-to-expose implementation of Python {@code join(iterable)}.
*
* @param iter iterable of objects capable of being regarded as byte arrays
* @return the byte array that is their join
*/
final synchronized PyByteArray basebytes_join(Iterable iter) {
List* The elements of the {@code PyTuple} returned by this method are instances of the same actual * type as {@code this}. * * @param sep the separator on which to partition this byte array * @return a tuple of (head, separator, tail) */ public PyTuple partition(PyObject sep) { return basebytes_partition(sep); } /** * Ready-to-expose implementation of Python {@code partition(sep)}. *
* The elements of the {@code PyTuple} returned by this method are instances of the same actual * type as {@code this}. * * @param sep the separator on which to partition this byte array * @return a tuple of (head, separator, tail) */ final synchronized PyTuple basebytes_partition(PyObject sep) { // View the separator as a byte array (or error if we can't) try (PyBuffer separator = getViewOrError(sep)) { // Create a Finder for the separator and set it on this byte array int n = checkForEmptySeparator(separator); Finder finder = new Finder(separator); finder.setText(this); // We only use it once, to find the first occurrence int p = finder.nextIndex() - offset; if (p >= 0) { // Found at p, so we'll be returning ([0:p], [p:p+n], [p+n:]) return partition(p, p + n); } else { // Not found: choose values leading to ([0:size], '', '') return partition(size, size); } } } /** * Construct return value for implementation of Python {@code partition(sep)} or * {@code rpartition(sep)}, returns {@code [0:p], [p:q], [q:]} *
* The elements of the {@code PyTuple} returned by this method are instances of the same actual * type as {@code this}. * * @param p start of separator * @param q start of tail * @return ([0:p], [p:q], [q:]) */ private PyTuple partition(int p, int q) { BaseBytes head = this.getslice(0, p); BaseBytes sep = this.getslice(p, q); BaseBytes tail = this.getslice(q, size); return new PyTuple(head, sep, tail); } /** * Ready-to-expose implementation of Python {@code rfind( sub [, start [, end ]] )}. Return the * highest index in the byte array where byte sequence {@code sub} is found, such that * {@code sub} is contained in the slice {@code [start:end]}. Arguments {@code start} and * {@code end} (which may be {@code null} or {@code Py.None}) are interpreted as in slice * notation. Return -1 if {@code sub} is not found. * * @param sub bytes to find * @param ostart of slice to search * @param oend of slice to search * @return index of start of occurrence of sub within this byte array */ final int basebytes_rfind(PyObject sub, PyObject ostart, PyObject oend) { try (PyBuffer vsub = getViewOrError(sub)) { Finder finder = new ReverseFinder(vsub); return find(finder, ostart, oend); } } /** * Common code for Python {@code find( sub [, start [, end ]] )} and * {@code rfind( sub [, start [, end ]] )}. Return the lowest or highest index in the byte array * where byte sequence used to construct {@code finder} is found. The particular type (plain * {@code Finder} or {@code ReverseFinder}) determines the direction. * * @param finder for the bytes to find, sometime forwards, sometime backwards * @param ostart of slice to search * @param oend of slice to search * @return index of start of occurrence of sub within this byte array */ private final int find(Finder finder, PyObject ostart, PyObject oend) { // Convert [ostart:oend] to integers int[] index = indicesEx(ostart, oend); // [ start, end, 1, end-start ] // Make this slice the thing we search. Note finder works with Java index in storage. finder.setText(storage, offset + index[0], index[3]); int result = finder.nextIndex(); // Compensate for the offset in returning a value return (result < 0) ? -1 : result - offset; } /** * An almost ready-to-expose implementation of Python {@code replace( old, new [, count ] )}, * returning a {@code PyByteArray} with all occurrences of sequence {@code oldB} replaced by * {@code newB}. If the optional argument {@code count} is given, only the first {@code count} * occurrences are replaced. * * @param oldB sequence to find * @param newB relacement sequence * @param maxcount maximum occurrences are replaced or < 0 for all * @return result of replacement as a new PyByteArray */ final synchronized PyByteArray basebytes_replace(PyObject oldB, PyObject newB, int maxcount) { // View the to and from as byte arrays (or error if we can't) try (PyBuffer to = getViewOrError(newB); PyBuffer from = getViewOrError(oldB)) { /* * The logic of the first section is copied exactly from CPython in order to get the * same behaviour. The "headline" description of replace is simple enough but the corner * cases can be surprising: */ // >>> bytearray(b'hello').replace(b'',b'-') // bytearray(b'-h-e-l-l-o-') // >>> bytearray(b'hello').replace(b'',b'-',3) // bytearray(b'-h-e-llo') // >>> bytearray(b'hello').replace(b'',b'-',1) // bytearray(b'-hello') // >>> bytearray().replace(b'',b'-') // bytearray(b'-') // >>> bytearray().replace(b'',b'-',1) # ? // bytearray(b'') if (maxcount < 0) { maxcount = Integer.MAX_VALUE; } else if (maxcount == 0 || size == 0) { // nothing to do; return the original bytes return new PyByteArray(this); } int from_len = from.getLen(); int to_len = to.getLen(); if (maxcount == 0 || (from_len == 0 && to_len == 0)) { // nothing to do; return the original bytes return new PyByteArray(this); } else if (from_len == 0) { // insert the 'to' bytes everywhere. // >>> "Python".replace("", ".") // '.P.y.t.h.o.n.' return replace_interleave(to, maxcount); } else if (size == 0) { // Special case for "".replace("", "A") == "A" return new PyByteArray(to); } else if (to_len == 0) { // Delete occurrences of the 'from' bytes return replace_delete_substring(from, maxcount); } else if (from_len == to_len) { // Result is same size as this byte array, whatever the number of replacements. return replace_substring_in_place(from, to, maxcount); } else { // Otherwise use the generic algorithm return replace_substring(from, to, maxcount); } } } /* * Algorithms for different cases of string replacement. CPython also has specialisations for * when 'from' or 'to' or both are single bytes. This may also be worth doing in Java when the * 'to' is a single byte. (The 'from' is turned into a Finder object which already makes a * special case of single bytes.) */ /** * Helper for {@link #basebytes_replace(PyObject, PyObject, int)} implementing the general case * of byte-string replacement when the new and old strings have different lengths. * * @param from byte-string to find and replace * @param to replacement byte-string * @param maxcount maximum number of replacements to make * @return the result as a new PyByteArray */ private PyByteArray replace_substring(PyBuffer from, PyBuffer to, int maxcount) { // size>=1, len(from)>=1, len(to)>=1, maxcount>=1 // Initialise a Finder for the 'from' pattern Finder finder = new Finder(from); int count = finder.count(storage, offset, size, maxcount); if (count == 0) { // no matches return new PyByteArray(this); } int from_len = from.getLen(); int to_len = to.getLen(); // Calculate length of result and check for too big long result_len = size + count * (to_len - from_len); byte[] r; // Build result here try { // Good to go. As we know the ultimate size, we can do all our allocation in one r = new byte[(int) result_len]; } catch (OutOfMemoryError e) { throw Py.OverflowError("replace bytes is too long"); } int p = offset; // Copy-from index in this.storage int rp = 0; // Copy-to index in r // Reset the Finder on the (active part of) this.storage finder.setText(storage, p, size); while (count-- > 0) { // First occurrence of 'from' bytes in storage int q = finder.nextIndex(); if (q < 0) { // Never happens because we've got count right break; } // Output the stretch up to the discovered occurrence of 'from' int length = q - p; if (length > 0) { System.arraycopy(storage, p, r, rp, length); rp += length; } // Skip over the occurrence of the 'from' bytes p = q + from_len; // Output a copy of 'to' to.copyTo(r, rp); rp += to_len; } // Copy the rest of the original string int length = size + offset - p; if (length > 0) { System.arraycopy(storage, p, r, rp, length); rp += length; } // Make r[] the storage of a new bytearray return new PyByteArray(r); } /** * Handle the interleaving case b'hello'.replace(b'', b'..') = b'..h..e..l..l..o..' At the call * site we are guaranteed: size>=1, to.getLen()>=1, maxcount>=1 * * @param to the replacement bytes as a byte-oriented view * @param maxcount upper limit on number of insertions */ private PyByteArray replace_interleave(PyBuffer to, int maxcount) { // Insert one at the beginning and one after every byte, or as many as allowed int count = size + 1; if (maxcount < count) { count = maxcount; } int to_len = to.getLen(); // Calculate length of result and check for too big long result_len = ((long) count) * to_len + size; byte[] r; // Build result here try { // Good to go. As we know the ultimate size, we can do all our allocation in one r = new byte[(int) result_len]; } catch (OutOfMemoryError e) { throw Py.OverflowError("replace bytes is too long"); } int p = offset; // Copy-from index in this.storage int rp = 0; // Copy-to index in r // Lay the first one down (guaranteed this will occur as count>=1) to.copyTo(r, rp); rp += to_len; // And the rest for (int i = 1; i < count; i++) { r[rp++] = storage[p++]; to.copyTo(r, rp); rp += to_len; } // Copy the rest of the original string int length = size + offset - p; if (length > 0) { System.arraycopy(storage, p, r, rp, length); rp += length; } // Make r[] the storage of a new bytearray return new PyByteArray(r); } /** * Helper for {@link #basebytes_replace(PyObject, PyObject, int)} implementing the special case * of byte-string replacement when the new string has zero length, i.e. deletion. * * @param from byte-string to find and delete * @param maxcount maximum number of deletions to make * @return the result as a new PyByteArray */ private PyByteArray replace_delete_substring(PyBuffer from, int maxcount) { // len(self)>=1, len(from)>=1, to="", maxcount>=1 // Initialise a Finder for the 'from' pattern Finder finder = new Finder(from); int count = finder.count(storage, offset, size, maxcount); if (count == 0) { // no matches return new PyByteArray(this); } int from_len = from.getLen(); long result_len = size - (count * from_len); assert (result_len >= 0); byte[] r; // Build result here try { // Good to go. As we know the ultimate size, we can do all our allocation in one r = new byte[(int) result_len]; } catch (OutOfMemoryError e) { throw Py.OverflowError("replace bytes is too long"); } int p = offset; // Copy-from index in this.storage int rp = 0; // Copy-to index in r // Reset the Finder on the (active part of) this.storage finder.setText(storage, offset, size); while (count-- > 0) { // First occurrence of 'from' bytes in storage int q = finder.nextIndex(); if (q < 0) { // Never happens because we've got count right break; } // Output the stretch up to the discovered occurrence of 'from' int length = q - p; if (length > 0) { System.arraycopy(storage, p, r, rp, length); rp += length; } // Skip over the occurrence of the 'from' bytes p = q + from_len; } // Copy the rest of the original string int length = size + offset - p; if (length > 0) { System.arraycopy(storage, p, r, rp, length); rp += length; } // Make r[] the storage of a new bytearray return new PyByteArray(r); } /** * Helper for {@link #basebytes_replace(PyObject, PyObject, int)} implementing the special case * of byte-string replacement when the new and old strings have the same length. The key * observation here is that the result has the same size as this byte array, and we know this * even without counting how many replacements we shall make. * * @param from byte-string to find and replace * @param to replacement byte-string * @param maxcount maximum number of replacements to make * @return the result as a new PyByteArray */ private PyByteArray replace_substring_in_place(PyBuffer from, PyBuffer to, int maxcount) { // len(self)>=1, len(from)==len(to)>=1, maxcount>=1 // Initialise a Finder for the 'from' pattern Finder finder = new Finder(from); int count = maxcount; // The result will be this.size byte[] r; // Build result here try { r = new byte[this.size]; } catch (OutOfMemoryError e) { throw Py.OverflowError("replace bytes is too long"); } System.arraycopy(storage, offset, r, 0, size); // Change everything in-place: easiest if we search the destination finder.setText(r); while (count-- > 0) { int q = finder.nextIndex(); // Note q is an index into result.storage if (q < 0) { // Normal exit because we discover actual count as we go along break; } // Overwrite with 'to' the stretch corresponding to the matched 'from' to.copyTo(r, q); } // Make r[] the storage of a new bytearray return new PyByteArray(r); } /** * Implementation of Python {@code rpartition(sep)}, returning a 3-tuple of byte arrays (of the * same type as {@code this}). * * Split the string at the rightmost occurrence of {@code sep}, and return a 3-tuple containing * the part before the separator, the separator itself, and the part after the separator. If the * separator is not found, return a 3-tuple containing two empty byte arrays, followed by the * byte array itself. *
* The elements of the {@code PyTuple} returned by this method are instances of the same actual * type as {@code this}. * * @param sep the separator on which to partition this byte array * @return a tuple of (head, separator, tail) */ public PyTuple rpartition(PyObject sep) { return basebytes_rpartition(sep); } /** * Ready-to-expose implementation of Python {@code rpartition(sep)}. *
* The elements of the {@code PyTuple} returned by this method are instances of the same actual * type as {@code this}. * * @param sep the separator on which to partition this byte array * @return a tuple of (head, separator, tail) */ final synchronized PyTuple basebytes_rpartition(PyObject sep) { // View the separator as a byte array (or error if we can't) try (PyBuffer separator = getViewOrError(sep)) { // Create a Finder for the separtor and set it on this byte array int n = checkForEmptySeparator(separator); Finder finder = new ReverseFinder(separator); finder.setText(this); // We only use it once, to find the first (from the right) occurrence int p = finder.nextIndex() - offset; if (p >= 0) { // Found at p, so we'll be returning ([0:p], [p:p+n], [p+n:]) return partition(p, p + n); } else { // Not found: choose values leading to ('', '', [0:size]) return partition(0, 0); } } } /** * Implementation of Python {@code rsplit()}, that returns a list of the words in the byte * array, using whitespace as the delimiter. See {@link #rsplit(PyObject, int)}. *
* The elements of the {@code PyList} returned by this method are instances of the same actual * type as {@code this}. * * @return PyList of byte arrays that result from the split */ public PyList rsplit() { return basebytes_rsplit_whitespace(-1); } /** * Implementation of Python {@code rsplit(sep)}, that returns a list of the words in the byte * array, using {@code sep} as the delimiter. See {@link #rsplit(PyObject, int)} for the * semantics of the separator. *
* The elements of the {@code PyList} returned by this method are instances of the same actual * type as {@code this}. * * @param sep {@code bytes}, or object viewable as bytes, defining the separator * @return PyList of byte arrays that result from the split */ public PyList rsplit(PyObject sep) { return basebytes_rsplit(sep, -1); } /** * Implementation of Python {@code rsplit(sep, maxsplit)}, that returns a list of the words in * the byte array, using {@code sep} as the delimiter. If {@code maxsplit} is given, at most * {@code maxsplit} splits are done (thus, the list will have at most {@code maxsplit+1} * elements). If {@code maxsplit} is not specified, then there is no limit on the number of * splits (all possible splits are made). *
* The semantics of {@code sep} and maxcount are identical to those of * {@code split(sep, maxsplit)} , except that splits are generated from the right (and pushed * onto the front of the result list). The result is only different from that of {@code split} * if {@code maxcount} limits the number of splits. For example, *
* The elements of the {@code PyList} returned by this method are instances of the same actual * type as {@code this}. * * @param sep {@code bytes}, or object viewable as bytes, defining the separator * @param maxsplit maximum number of splits * @return PyList of byte arrays that result from the split */ public PyList rsplit(PyObject sep, int maxsplit) { return basebytes_rsplit(sep, maxsplit); } /** * Ready-to-expose implementation of Python {@code rsplit(sep, maxsplit)}, that returns a list * of the words in the byte array, using {@code sep} as the delimiter. Use the defines * whitespace semantics if {@code sep} is {@code null}. *
* The elements of the {@code PyList} returned by this method are instances of the same actual * type as {@code this}. * * @param sep {@code bytes}, or object viewable as bytes, defining the separator * @param maxsplit maximum number of splits * @return PyList of byte arrays that result from the split */ final PyList basebytes_rsplit(PyObject sep, int maxsplit) { if (sep == null || sep == Py.None) { return basebytes_rsplit_whitespace(maxsplit); } else { return basebytes_rsplit_explicit(sep, maxsplit); } } /** * Implementation of Python {@code rsplit(sep, maxsplit)}, that returns a list of the words in * the byte array, using {@code sep} (which is not {@code null}) as the delimiter. If * {@code maxsplit>=0}, at most {@code maxsplit} splits are done (thus, the list will have at * most {@code maxsplit+1} elements). If {@code maxsplit<0}, then there is no limit on the * number of splits (all possible splits are made). *
* The elements of the {@code PyList} returned by this method are instances of the same actual * type as {@code this}. * * @param sep {@code bytes}, or object viewable as bytes, defining the separator * @param maxsplit maximum number of splits * @return PyList of byte arrays that result from the split */ final synchronized PyList basebytes_rsplit_explicit(PyObject sep, int maxsplit) { // The separator may be presented as anything viewable as bytes try (PyBuffer separator = getViewOrError(sep)) { int n = checkForEmptySeparator(separator); PyList result = new PyList(); // Use the Finder class to search in the storage of this byte array Finder finder = new ReverseFinder(separator); finder.setText(this); int q = offset + size; // q points to "honorary separator" int p; // At this point storage[q-1] is the last byte of the rightmost unsplit word, or // q=offset if there aren't any. While we have some splits left to do ... while (q > offset && maxsplit-- != 0) { // Delimit the word whose last byte is storage[q-1] int r = q; // Skip p backwards over the word and the separator q = finder.nextIndex(); if (q < 0) { p = offset; } else { p = q + n; } // storage[p] is the first byte of the word. BaseBytes word = getslice(p - offset, r - offset); result.add(0, word); } // Prepend the remaining unsplit text if any if (q >= offset) { BaseBytes word = getslice(0, q - offset); result.add(0, word); } return result; } } /** * Implementation of Python {@code rsplit(None, maxsplit)}, that returns a list of the words in * the byte array, using whitespace as the delimiter. If {@code maxsplit} is given, at most * {@code maxsplit} splits are done (thus, the list will have at most {@code maxsplit+1} * elements). If maxsplit is not specified, then there is no limit on the number of splits (all * possible splits are made). *
* Runs of consecutive whitespace are regarded as a single separator, and the result will * contain no empty strings at the start or end if the string has leading or trailing * whitespace. Consequently, splitting an empty string or a string consisting of just whitespace * with a {@code None} separator returns []. *
* The elements of the {@code PyList} returned by this method are instances of the same actual * type as {@code this/self}. * * @param maxsplit maximum number of splits * @return PyList of byte arrays that result from the split */ final synchronized PyList basebytes_rsplit_whitespace(int maxsplit) { PyList result = new PyList(); int p, q; // Indexes of unsplit text and whitespace // Scan backwards over trailing whitespace for (q = offset + size; q > offset; --q) { if (!isspace(storage[q - 1])) { break; } } // Note: bytearray().rsplit() = bytearray(b' ').rsplit() = [] // At this point storage[q-1] is the rightmost non-space byte, or // q=offset if there aren't any. While we have some splits left ... while (q > offset && maxsplit-- != 0) { // Delimit the word whose last byte is storage[q-1] // Skip p backwards over the non-whitespace for (p = q; p > offset; --p) { if (isspace(storage[p - 1])) { break; } } // storage[p] is the first byte of the word. (p=offset or storage[p-1] is whitespace.) BaseBytes word = getslice(p - offset, q - offset); result.add(0, word); // Skip q backwards over the whitespace for (q = p; q > offset; --q) { if (!isspace(storage[q - 1])) { break; } } } // Prepend the remaining unsplit text if any if (q > offset) { BaseBytes word = getslice(0, q - offset); result.add(0, word); } return result; } /** * Implementation of Python {@code split()}, that returns a list of the words in the byte array, * using whitespace as the delimiter. See {@link #split(PyObject, int)}. *
* The elements of the {@code PyList} returned by this method are instances of the same actual * type as {@code this}. * * @return PyList of byte arrays that result from the split */ public PyList split() { return basebytes_split_whitespace(-1); } /** * Implementation of Python {@code split(sep)}, that returns a list of the words in the byte * array, using {@code sep} as the delimiter. See {@link #split(PyObject, int)} for the * semantics of the separator. *
* The elements of the {@code PyList} returned by this method are instances of the same actual * type as {@code this}. * * @param sep {@code bytes}, or object viewable as bytes, defining the separator * @return PyList of byte arrays that result from the split */ public PyList split(PyObject sep) { return basebytes_split(sep, -1); } /** * Implementation of Python {@code split(sep, maxsplit)}, that returns a list of the words in * the byte array, using {@code sep} as the delimiter. If {@code maxsplit} is given, at most * {@code maxsplit} splits are done. (Thus, the list will have at most {@code maxsplit+1} * elements). If {@code maxsplit} is not specified, then there is no limit on the number of * splits (all possible splits are made). *
* If {@code sep} is given, consecutive delimiters are not grouped together and are deemed to * delimit empty strings (for example, {@code '1,,2'.split(',')} returns * {@code ['1', '', '2']}). The {@code sep} argument may consist of multiple characters (for * example, {@code '1<>2<>3'.split('<>')} returns {@code ['1', '2', '3']}). Splitting an empty * string with a specified separator {@code ['']}. *
* If {@code sep} is not specified or is {@code None}, a different splitting algorithm is * applied: runs of consecutive whitespace are regarded as a single separator, and the result * will contain no empty strings at the start or end if the string has leading or trailing * whitespace. Consequently, splitting an empty string or a string consisting of just whitespace * with a {@code None} separator returns {@code []}. For example, *
* The elements of the {@code PyList} returned by this method are instances of the same actual * type as {@code this}. * * @param sep {@code bytes}, or object viewable as bytes, defining the separator * @param maxsplit maximum number of splits * @return PyList of byte arrays that result from the split */ public PyList split(PyObject sep, int maxsplit) { return basebytes_split(sep, maxsplit); } /** * Ready-to-expose implementation of Python {@code split(sep, maxsplit)}, that returns a list of * the words in the byte array, using {@code sep} as the delimiter. Use the defines whitespace * semantics if {@code sep} is {@code null}. *
* The elements of the {@code PyList} returned by this method are instances of the same actual * type as {@code this}. * * @param sep {@code bytes}, or object viewable as bytes, defining the separator * @param maxsplit maximum number of splits * @return PyList of byte arrays that result from the split */ final PyList basebytes_split(PyObject sep, int maxsplit) { if (sep == null || sep == Py.None) { return basebytes_split_whitespace(maxsplit); } else { return basebytes_split_explicit(sep, maxsplit); } } /** * Implementation of Python {@code split(sep, maxsplit)}, that returns a list of the words in * the byte array, using {@code sep} (which is not {@code null}) as the delimiter. If * {@code maxsplit>=0}, at most {@code maxsplit} splits are done (thus, the list will have at * most {@code maxsplit+1} elements). If {@code maxsplit<0}, then there is no limit on the * number of splits (all possible splits are made). *
* The elements of the {@code PyList} returned by this method are instances of the same actual * type as {@code this}. * * @param sep {@code bytes}, or object viewable as bytes, defining the separator * @param maxsplit maximum number of splits * @return PyList of byte arrays that result from the split */ final synchronized PyList basebytes_split_explicit(PyObject sep, int maxsplit) { // The separator may be presented as anything viewable as bytes try (PyBuffer separator = getViewOrError(sep)) { checkForEmptySeparator(separator); PyList result = new PyList(); // Use the Finder class to search in the storage of this byte array Finder finder = new Finder(separator); finder.setText(this); // Look for the first separator int p = finder.currIndex(); // = offset int q = finder.nextIndex(); // First separator (or <0 if not found) // Note: bytearray().split(' ') == [bytearray(b'')] // While we found a separator, and we have some splits left (if maxsplit started>=0) while (q >= 0 && maxsplit-- != 0) { // Note the Finder works in terms of indexes into this.storage result.append(getslice(p - offset, q - offset)); p = finder.currIndex(); // Start of unsplit text q = finder.nextIndex(); // Next separator (or <0 if not found) } // Append the remaining unsplit text result.append(getslice(p - offset, size)); return result; } } /** * Implementation of Python {@code split(None, maxsplit)}, that returns a list of the words in * the byte array, using whitespace as the delimiter. If {@code maxsplit} is given, at most * maxsplit splits are done (thus, the list will have at most {@code maxsplit+1} elements). If * {@code maxsplit} is not specified, then there is no limit on the number of splits (all * possible splits are made). *
* Runs of consecutive whitespace are regarded as a single separator, and the result will * contain no empty strings at the start or end if the string has leading or trailing * whitespace. Consequently, splitting an empty string or a string consisting of just whitespace * with a {@code None} separator returns []. *
* The elements of the {@code PyList} returned by this method are instances of the same actual
* type as {@code this}.
*
* @param maxsplit maximum number of splits
* @return PyList of byte arrays that result from the split
*/
final synchronized PyList basebytes_split_whitespace(int maxsplit) {
PyList result = new PyList();
int limit = offset + size;
int p, q; // Indexes of unsplit text and whitespace
// Scan over leading whitespace
for (p = offset; p < limit && isspace(storage[p]); p++) {
; // continue
}
// Note: bytearray().split() = bytearray(b' ').split() = []
// At this point if p
* The elements of the {@code PyList} returned by this method are instances of the same actual * type as {@code this}. * * @return List of segments */ public PyList splitlines() { return basebytes_splitlines(false); } /** * Implementation of Python {@code splitlines(keepends)}, returning a list of the lines in the * string, breaking at line boundaries. Line breaks are not included in the resulting list * unless {@code keepends} is true. *
* The elements of the {@code PyList} returned by this method are instances of the same actual * type as {@code this}. * * @param keepends if true, include the end of line bytes(s) * @return PyList of segments */ public PyList splitlines(boolean keepends) { return basebytes_splitlines(keepends); } /** * Ready-to-expose implementation of Python {@code splitlines(keepends)}, returning a list of * the lines in the array, breaking at line boundaries. Line breaks are not included in the * resulting list unless {@code keepends} is given and true. *
* The elements of the {@code PyList} returned by this method are instances of the same actual * type as {@code this}. * * @param keepends if true, include the end of line bytes(s) * @return List of segments */ protected final synchronized PyList basebytes_splitlines(boolean keepends) { PyList list = new PyList(); int limit = offset + size; for (int p = offset; p < limit; /* p advanced in loop */) { int q, lenEOL = 0; // Scan q to the end of the line (or buffer) including the EOL bytes for (q = p; q < limit; q++) { byte b = storage[q]; if (b == '\r') { lenEOL = (storage[q + 1] == '\n') ? 2 : 1; break; } else if (b == '\n') { lenEOL = 1; // Just one EOL byte \n break; } } // lenEOL =2: the line ended \r\n, and q points at \r; // lenEOL =1: the line ended \n or \r (only), and q points at it; // lenEOL =0: the line ended with the end of the data (and q=limit) if (keepends) { list.append(getslice(p - offset, q + lenEOL - offset)); } else { list.append(getslice(p - offset, q - offset)); } // Start next line after what terminated it p = q + lenEOL; } return list; } // // Padding, filling and centering // /** * Helper to check the fill byte for {@link #basebytes_rjust(int, String)}, * {@link #basebytes_ljust(int, String)} and {@link #basebytes_center(int, String)}, which is * required to be a single character string, treated as a byte. * * @param function name * @param fillchar or {@code null} * @return the (checked) single fill byte */ protected static byte fillByteCheck(String function, String fillchar) { if (fillchar == null) { return ' '; } else if (fillchar.length() == 1) { return (byte) fillchar.charAt(0); } else { throw Py.TypeError(function + "() argument 2 must be char, not str"); } } /** * Helper function to construct the return value for {@link #rjust(String)}, * {@link #ljust(String)} and {@link #center(String)}. Clients calculate the left and right fill * values according to their nature, and ignoring the possibility that the desired * {@code width=left+size+right} may be less than {@code this.size}. This method does all the * work, and deals with that exceptional case by returning {@code self[:]}. * * @param pad byte to fill with * @param left padding requested * @param right padding requested * @return (possibly new) byte array containing the result */ private BaseBytes padHelper(byte pad, int left, int right) { if (left + right <= 0) { // Deal here with the case wher width <= size, and no padding is necessary. // If this is immutable getslice may return this same object return getslice(0, size); } // Construct the result in a Builder of the desired width Builder builder = new Builder(left + size + right); builder.repeat(pad, left); builder.append(this); builder.repeat(pad, right); return getResult(builder); } /** * A ready-to-expose implementation of Python {@code center(width [, fillchar])}: return the * bytes centered in an array of length {@code width}. Padding is done using the specified * fillchar (default is a space). A copy of the original byte array is returned if {@code width} * is less than {@code this.size()}. (Immutable subclasses may return exactly the original * object.) * * @param width desired * @param fillchar one-byte String to fill with, or {@code null} implying space * @return (possibly new) byte array containing the result */ final BaseBytes basebytes_center(int width, String fillchar) { // Argument check and default byte pad = fillByteCheck("center", fillchar); // How many pads will I need? int fill = width - size; // CPython uses this formula, which makes a difference when width is odd and size even int left = fill / 2 + (fill & width & 1); return padHelper(pad, left, fill - left); } /** * A ready-to-expose implementation of Python {@code ljust(width [, fillchar])}: return the * bytes left-justified in an array of length {@code width}. Padding is done using the specified * fillchar (default is a space). A copy of the original byte array is returned if {@code width} * is less than {@code this.size()}. (Immutable subclasses may return exactly the original * object.) * * @param width desired * @param fillchar one-byte String to fill with, or {@code null} implying space * @return (possibly new) byte array containing the result */ final BaseBytes basebytes_ljust(int width, String fillchar) { // Argument check and default byte pad = fillByteCheck("rjust", fillchar); // How many pads will I need? int fill = width - size; return padHelper(pad, 0, fill); } /** * A ready-to-expose implementation of Python {@code rjust(width [, fillchar])}: return the * bytes right-justified in an array of length {@code width}. Padding is done using the * specified fillchar (default is a space). A copy of the original byte array is returned if * {@code width} is less than {@code this.size()}. (Immutable subclasses may return exactly the * original object.) * * @param width desired * @param fillchar one-byte String to fill with, or {@code null} implying space * @return (possibly new) byte array containing the result */ final BaseBytes basebytes_rjust(int width, String fillchar) { // Argument check and default byte pad = fillByteCheck("rjust", fillchar); // How many pads will I need? int fill = width - size; return padHelper(pad, fill, 0); } /** * Ready-to-expose implementation of Python {@code expandtabs([tabsize])}: return a copy of the * byte array where all tab characters are replaced by one or more spaces, depending on the * current column and the given tab size. The column number is reset to zero after each newline * occurring in the array. This treats other non-printing characters or escape sequences as * regular characters. *
* The actual class of the returned object is determined by {@link #getResult(Builder)}. * * @param tabsize number of character positions between tab stops * @return copy of this byte array with tabs expanded */ final BaseBytes basebytes_expandtabs(int tabsize) { // We could only work out the true size by doing the work twice, // so make a guess and let the Builder re-size if it's not enough. long estimatedSize = (long) size + size / 8; Builder builder = new Builder(estimatedSize); int carriagePosition = 0; int limit = offset + size; for (int i = offset; i < limit; i++) { byte c = storage[i]; if (c == '\t') { // Number of spaces is 1..tabsize int spaces = tabsize - carriagePosition % tabsize; builder.repeat((byte) ' ', spaces); carriagePosition += spaces; } else { // Transfer the character, but if it is a line break, reset the carriage builder.append(c); carriagePosition = (c == '\n' || c == '\r') ? 0 : carriagePosition + 1; } } return getResult(builder); } /** * Ready-to-expose implementation of Python {@code zfill(width):} return the numeric string left * filled with zeros in a byte array of length width. A sign prefix is handled correctly if it * is in the first byte. A copy of the original is returned if width is less than the current * size of the array. * * @param width desired * @return left-filled byte array */ final BaseBytes basebytes_zfill(int width) { // How many zeros will I need? int fill = width - size; Builder builder = new Builder((fill > 0) ? width : size); if (fill <= 0) { // width <= size so result is just a copy of this array builder.append(this); } else { // At least one zero must be added. Transfer the sign byte (if any) first. int p = 0; if (size > 0) { byte sign = storage[offset]; if (sign == '-' || sign == '+') { builder.append(sign); p = 1; } } // Now insert enough zeros builder.repeat((byte) '0', fill); // And finally the numeric part. Note possibility of no text eg. ''.zfill(6). if (size > p) { builder.append(this, p, size); } } return getResult(builder); } // // Character class operations // // Bit to twiddle (XOR) for lowercase letter to uppercase and vice-versa. private static final int SWAP_CASE = 0x20; // Bit masks and sets to use with the byte classification table private static final byte UPPER = 0b1; private static final byte LOWER = 0b10; private static final byte DIGIT = 0b100; private static final byte SPACE = 0b1000; private static final byte ALPHA = UPPER | LOWER; private static final byte ALNUM = ALPHA | DIGIT; // Character (byte) classification table. private static final byte[] ctype = new byte[256]; static { for (int c = 'A'; c <= 'Z'; c++) { ctype[0x80 + c] = UPPER; ctype[0x80 + SWAP_CASE + c] = LOWER; } for (int c = '0'; c <= '9'; c++) { ctype[0x80 + c] = DIGIT; } for (char c : " \t\n\u000b\f\r".toCharArray()) { ctype[0x80 + c] = SPACE; } } /** @return 'A'<= b <='Z'. */ static final boolean isupper(byte b) { return (ctype[0x80 + b] & UPPER) != 0; } /** @return 'a'<= b <='z'. */ static final boolean islower(byte b) { return (ctype[0x80 + b] & LOWER) != 0; } /** @return 'A'<= b <='Z' or 'a'<= b <='z'. */ static final boolean isalpha(byte b) { return (ctype[0x80 + b] & ALPHA) != 0; } /** @return '0'<= b <='9'. */ static final boolean isdigit(byte b) { return (ctype[0x80 + b] & DIGIT) != 0; } /** @return 'A'<= b <='Z' or 'a'<= b <='z' or '0'<= b <='9'. */ static final boolean isalnum(byte b) { return (ctype[0x80 + b] & ALNUM) != 0; } /** @return b in ' \t\n\v\f\r' */ static final boolean isspace(byte b) { return (ctype[0x80 + b] & SPACE) != 0; } /** * Java API equivalent of Python {@code isalnum()}. This method treats the bytes as US-ASCII * code points. * * @return true if all bytes in the array are code points for alphanumerics and there is at * least one byte, false otherwise. */ public boolean isalnum() { return basebytes_isalnum(); } /** * Ready-to-expose implementation of Python {@code isalnum()}. * * @return true if all bytes in the array are code points for alphanumerics and there is at * least one byte, false otherwise. */ final boolean basebytes_isalnum() { if (size == 1) { // Special case strings of length one (i.e. characters) return isalnum(storage[offset]); } else { // Work through the bytes, stopping early if the test is false. for (int i = 0; i < size; i++) { if (!isalnum(storage[offset + i])) { return false; } } // Result is true if we reached the end (and there were some bytes) return size > 0; } } /** * Java API equivalent of Python {@code isalpha()}. This method treats the bytes as US-ASCII * code points. * * @return true if all bytes in the array are alphabetic and there is at least one byte, false * otherwise */ public boolean isalpha() { return basebytes_isalpha(); } /** * Ready-to-expose implementation of Python {@code isalpha()}. * * @return true if all bytes in the array are alphabetic and there is at least one byte, false * otherwise */ final boolean basebytes_isalpha() { if (size == 1) { // Special case strings of length one (i.e. characters) return isalpha(storage[offset]); } else { // Work through the bytes, stopping early if the test is false. for (int i = 0; i < size; i++) { if (!isalpha(storage[offset + i])) { return false; } } // Result is true if we reached the end (and there were some bytes) return size > 0; } } /** * Java API equivalent of Python {@code isdigit()}. This method treats the bytes as US-ASCII * code points. * * @return true if all bytes in the array are code points for digits and there is at least one * byte, false otherwise. */ public boolean isdigit() { return basebytes_isdigit(); } /** * Ready-to-expose implementation of Python {@code isdigit()}. * * @return true if all bytes in the array are code points for digits and there is at least one * byte, false otherwise. */ final boolean basebytes_isdigit() { if (size == 1) { // Special case strings of length one (i.e. characters) return isdigit(storage[offset]); } else { // Work through the bytes, stopping early if the test is false. for (int i = 0; i < size; i++) { if (!isdigit(storage[offset + i])) { return false; } } // Result is true if we reached the end (and there were some bytes) return size > 0; } } /** * Java API equivalent of Python {@code islower()}. This method treats the bytes as US-ASCII * code points. * * @return true if all cased bytes in the array are code points for lowercase characters and * there is at least one cased byte, false otherwise. */ public boolean islower() { return basebytes_islower(); } /** * Ready-to-expose implementation of Python {@code islower()}. * * @return true if all cased bytes in the array are code points for lowercase characters and * there is at least one cased byte, false otherwise. */ final boolean basebytes_islower() { if (size == 1) { // Special case strings of length one (i.e. characters) return islower(storage[offset]); } else { int i; byte c = 0; // Test the bytes until a cased byte is encountered for (i = 0; i < size; i++) { if (isalpha(c = storage[offset + i])) { break; } } if (i == size || isupper(c)) { // We reached the end without finding a cased byte, or it was upper case. return false; } // Continue to end or until an upper case byte is encountered for (i = i + 1; i < size; i++) { if (isupper(storage[offset + i])) { return false; } } // Found no upper case bytes, and at least one lower case byte. return true; } } /** * Java API equivalent of Python {@code isspace()}. This method treats the bytes as US-ASCII * code points. * * @return true if all the bytes in the array are code points for whitespace characters and * there is at least one byte, false otherwise. */ public boolean isspace() { return basebytes_isspace(); } /** * Ready-to-expose implementation of Python {@code isspace()}. * * @return true if all the bytes in the array are code points for whitespace characters and * there is at least one byte, false otherwise. */ final boolean basebytes_isspace() { if (size == 1) { // Special case strings of length one (i.e. characters) return isspace(storage[offset]); } else { // Work through the bytes, stopping early if the test is false. for (int i = 0; i < size; i++) { if (!isspace(storage[offset + i])) { return false; } } // Result is true if we reached the end (and there were some bytes) return size > 0; } } /** * Java API equivalent of Python {@code istitle()}. This method treats the bytes as US-ASCII * code points. * * @return true if the string is a titlecased string and there is at least one cased byte, for * example uppercase characters may only follow uncased bytes and lowercase characters * only cased ones. Return false otherwise. */ public boolean istitle() { return basebytes_istitle(); } /** * Ready-to-expose implementation of Python {@code istitle()}. * * @return true if the string is a titlecased string and there is at least one cased byte, for * example uppercase characters may only follow uncased bytes and lowercase characters * only cased ones. Return false otherwise. */ final boolean basebytes_istitle() { int state = 0; // 0 = have seen no cased characters (can't be in a word) // 1 = have seen cased character, but am not in a word // 2 = in a word (hence have have seen cased character) for (int i = 0; i < size; i++) { byte c = storage[offset + i]; if (isupper(c)) { if (state == 2) { // Violation: can't continue a word in upper case return false; } else { // Validly in a word state = 2; } } else if (islower(c)) { if (state != 2) { // Violation: can't start a word in lower case return false; } } else { if (state == 2) { // Uncased character: end of the word as we know it state = 1; } } } // Found no case violations, but did we find any cased bytes at all? return state != 0; } /** * Java API equivalent of Python {@code isupper()}. This method treats the bytes as US-ASCII * code points. * * @return true if all cased bytes in the array are code points for uppercase characters and * there is at least one cased byte, false otherwise. */ public boolean isupper() { return basebytes_isupper(); } /** * Ready-to-expose implementation of Python {@code isupper()}. * * @return true if all cased bytes in the array are code points for uppercase characters and * there is at least one cased byte, false otherwise. */ final boolean basebytes_isupper() { if (size == 1) { // Special case strings of length one (i.e. characters) return isupper(storage[offset]); } else { int i; byte c = 0; // Test the bytes until a cased byte is encountered for (i = 0; i < size; i++) { if (isalpha(c = storage[offset + i])) { break; } } if (i == size || islower(c)) { // We reached the end without finding a cased byte, or it was lower case. return false; } // Continue to end or until a lower case byte is encountered for (i = i + 1; i < size; i++) { if (islower(storage[offset + i])) { return false; } } // Found no lower case bytes, and at least one upper case byte. return true; } } // // Case transformations // /** * Java API equivalent of Python {@code capitalize()}. This method treats the bytes as US-ASCII * code points. The {@code BaseBytes} returned by this method has the same actual type as * {@code this/self}. * * @return a copy of the array with its first character capitalized and the rest lowercased. */ public BaseBytes capitalize() { return basebytes_capitalize(); } /** * Ready-to-expose implementation of Python {@code capitalize()}. The {@code BaseBytes} returned * by this method has the same actual type as {@code this/self}. * * @return a copy of the array with its first character capitalized and the rest lowercased. */ final BaseBytes basebytes_capitalize() { Builder builder = new Builder(size); if (size > 0) { // Treat first character byte c = storage[offset]; if (islower(c)) { c ^= SWAP_CASE; // 'a' -> 'A', etc. } // Put the adjusted character in the output as a byte builder.append(c); // Treat the rest for (int i = 1; i < size; i++) { c = storage[offset + i]; if (isupper(c)) { c ^= SWAP_CASE; // 'A' -> 'a', etc. } // Put the adjusted character in the output as a byte builder.append(c); } } return getResult(builder); } /** * Java API equivalent of Python {@code lower()}. This method treats the bytes as US-ASCII code * points. The {@code BaseBytes} returned by this method has the same actual type as * {@code this/self}. * * @return a copy of the array with all the cased characters converted to lowercase. */ public BaseBytes lower() { return basebytes_lower(); } /** * Ready-to-expose implementation of Python {@code lower()}. The {@code BaseBytes} returned by * this method has the same actual type as {@code this/self}. * * @return a copy of the array with all the cased characters converted to lowercase. */ final BaseBytes basebytes_lower() { Builder builder = new Builder(size); for (int i = 0; i < size; i++) { byte c = storage[offset + i]; if (isupper(c)) { c ^= SWAP_CASE; // 'A' -> 'a', etc. } // Put the adjusted character in the output as a byte builder.append(c); } return getResult(builder); } /** * Java API equivalent of Python {@code swapcase()}. This method treats the bytes as US-ASCII * code points. The {@code BaseBytes} returned by this method has the same actual type as * {@code this/self}. * * @return a copy of the array with uppercase characters converted to lowercase and vice versa. */ public BaseBytes swapcase() { return basebytes_swapcase(); } /** * Ready-to-expose implementation of Python {@code swapcase()}. The {@code BaseBytes} returned * by this method has the same actual type as {@code this/self}. * * @return a copy of the array with uppercase characters converted to lowercase and vice versa. */ final BaseBytes basebytes_swapcase() { Builder builder = new Builder(size); for (int i = 0; i < size; i++) { byte c = storage[offset + i]; if (isalpha(c)) { c ^= SWAP_CASE; // 'a' -> 'A', 'A' -> 'a', etc. } // Put the adjusted character in the output as a byte builder.append(c); } return getResult(builder); } /** * Java API equivalent of Python {@code title()}. The algorithm uses a simple * language-independent definition of a word as groups of consecutive letters. The definition * works in many contexts but it means that apostrophes in contractions and possessives form * word boundaries, which may not be the desired result. The {@code BaseBytes} returned by this * method has the same actual type as {@code this/self}. * * @return a titlecased version of the array where words start with an uppercase character and * the remaining characters are lowercase. */ public BaseBytes title() { return basebytes_title(); } /** * Ready-to-expose implementation of Python {@code title()}. The {@code BaseBytes} returned by * this method has the same actual type as {@code this/self}. * * @return a titlecased version of the array where words start with an uppercase character and * the remaining characters are lowercase. */ final BaseBytes basebytes_title() { Builder builder = new Builder(size); boolean inWord = false; // We begin, not in a word (sequence of cased characters) for (int i = 0; i < size; i++) { byte c = storage[offset + i]; if (!inWord) { // When we are not in a word ... if (islower(c)) { c ^= SWAP_CASE; // ... a lowercase letter must be upcased inWord = true; // and it starts a word. } else if (isupper(c)) { inWord = true; // ... an uppercase letter just starts the word } } else { // When we are in a word ... if (isupper(c)) { c ^= SWAP_CASE; // ... an uppercase letter must be downcased } else if (!islower(c)) { inWord = false; // ... and a non-letter ends the word } } // Put the adjusted character in the output as a byte builder.append(c); } return getResult(builder); } /** * Java API equivalent of Python {@code upper()}. Note that {@code x.upper().isupper()} might be * {@code false} if the array contains uncased characters. The {@code BaseBytes} returned by * this method has the same actual type as {@code this/self}. * * @return a copy of the array with all the cased characters converted to uppercase. */ public BaseBytes upper() { return basebytes_upper(); } /** * Ready-to-expose implementation of Python {@code upper()}. The {@code BaseBytes} returned by * this method has the same actual type as {@code this/self}. * * @return a copy of the array with all the cased characters converted to uppercase. */ final BaseBytes basebytes_upper() { Builder builder = new Builder(size); for (int i = 0; i < size; i++) { byte c = storage[offset + i]; if (islower(c)) { c ^= SWAP_CASE; // 'a' -> 'A' etc. } // Put the adjusted character in the output as a byte builder.append(c); } return getResult(builder); } /* * ============================================================================================ * Java API for access as byte[] * ============================================================================================ * * Just the immutable case for now */ /** * No range check access to byte[index]. * * @param index * @return the byte at the given index */ private final synchronized byte byteAt(int index) { return storage[index + offset]; } /** * Return the Python byte (in range 0 to 255 inclusive) at the given index. * * @param index of value in byte array * @return the integer value at the index * @throws PyException {@code IndexError} if the index is outside the array bounds */ public synchronized int intAt(int index) throws PyException { indexCheck(index); return 0xff & byteAt(index); } // // str() and repr() have different behaviour (despite PEP 3137) // /** * Helper for __repr__() * * @param buf destination for characters * @param c curren (maybe unprintable) character */ private static final void appendHexEscape(StringBuilder buf, int c) { buf.append("\\x").append(Character.forDigit((c & 0xf0) >> 4, 16)) .append(Character.forDigit(c & 0xf, 16)); } /** * Almost ready-to-expose Python {@code __repr__()}, based on treating the bytes as point codes. * The value added by this method is conversion of non-printing code points to hexadecimal * escapes in printable ASCII, and bracketed by the given before and after strings. These are * used to get the required presentation: * *
* bytearray(b'Hello world!')
*
*
* with the possibility that subclasses might choose something different.
*
* @param before String to insert before the quoted text
* @param after String to insert after the quoted text
* @return string representation: {@code before + "'" + String(this) + "'" + after}
*/
final synchronized String basebytes_repr(String before, String after) {
// Safety
if (before == null) {
before = "";
}
if (after == null) {
after = "";
}
// Guess how long the result might be
int guess = size + (size >> 2) + before.length() + after.length() + 10;
StringBuilder buf = new StringBuilder(guess);
buf.append(before).append('\'');
// Scan and translate the bytes of the array
int jmax = offset + size;
for (int j = offset; j < jmax; j++) {
int c = 0xff & storage[j];
if (c >= 0x7f) { // Unprintable high 128 and DEL
appendHexEscape(buf, c);
} else if (c >= ' ') { // Printable
if (c == '\\' || c == '\'') { // Special cases
buf.append('\\');
}
buf.append((char) c);
} else if (c == '\t') { // Special cases in the low 32
buf.append("\\t");
} else if (c == '\n') {
buf.append("\\n");
} else if (c == '\r') {
buf.append("\\r");
} else {
appendHexEscape(buf, c);
}
}
buf.append('\'').append(after);
return buf.toString();
}
/*
* ============================================================================================
* API for java.util.List* It is intended the client call this method only once to get the result of a series of * append operations. A second call to {@link #getStorage()}, before any further appending, * returns a zero-length array. This is to ensure that the same array is not given out * twice. * * @return an array containing the accumulated result */ byte[] getStorage() { byte[] s = storage; storage = emptyStorage; return s; } /** * Number of bytes accumulated. In conjunction with {@link #getStorage()}, this provides the * result. Unlike {@link #getStorage()}, it does not affect the contents of the builder. * * @return number of bytes accumulated */ final int getSize() { return size; } /** * Append a single byte to the value. * * @param b */ void append(byte b) { makeRoomFor(1); storage[size++] = b; } /** * Append a number of repeats of a single byte to the value, for example in padding. * * @param b byte to repeat * @param n number of repeats (none if n<=0) */ void repeat(byte b, int n) { if (n > 0) { makeRoomFor(n); while (--n >= 0) { storage[size++] = b; } } } /** * Append the contents of the given byte array. * * @param b */ void append(BaseBytes b) { append(b, 0, b.size); } /** * Repeat the contents of the given byte array. * * @param b */ void repeat(BaseBytes b, int n) { repeat(b, 0, b.size, n); } /** * Append the contents of a slice of the given byte array. * * @param b * @param start index of first byte copied * @param end index of first byte not copied */ void append(BaseBytes b, int start, int end) { int len = end - start; makeRoomFor(len); System.arraycopy(b.storage, b.offset + start, storage, size, len); size += len; } /** * Repeat the contents of a slice of the given byte array. * * @param b * @param start index of first byte copied * @param end index of first byte not copied * @param n number of repetitions */ void repeat(BaseBytes b, int start, int end, int n) { int len = end - start; makeRoomFor(len * (long) n); start += b.offset; while (--n >= 0) { System.arraycopy(b.storage, start, storage, size, len); size += len; } } /** * Append the contents of the given {@link PyBuffer}. * * @param b */ void append(PyBuffer v) { int n = v.getLen(); makeRoomFor(n); v.copyTo(storage, size); size += n; } /** * Ensure there is room for an additional {@code n} bytes, if necessary allocating a new * {@code byte[]} and copying the contents. Trap and convert to {@code PyException} any * overflow. * * @param n additional capacity requested ({@code long} so we can recognise overflow). * @throws PyException {@code OverflowError} when {@code sys.maxsize} is exceeded. * @throws PyException {@code MemoryError} when free heap or JVM limitation is exceeded. */ final void makeRoomFor(long n) throws PyException { long needed = size + n; if (needed > storage.length) { if (size > 0 && storage == emptyStorage) { // Special case where append comes after a getStorage(). size = 0; needed = n; } // Guardedly allocate the needed amount (or a rounded-up amount) if (needed > PySystemState.maxsize) { throw Py.OverflowError("max bytes len is " + PySystemState.maxsize); } else if (needed <= 0) { storage = emptyStorage; } else { try { if (size == 0) { // Just a new array storage = new byte[(int) needed]; } else { // New array preserving existing contents byte[] existing = storage; storage = new byte[roundUp((int) needed)]; System.arraycopy(existing, 0, storage, 0, size); } } catch (OutOfMemoryError e) { // Exceeded the available heap or the limits of this JVM. throw Py.MemoryError(e.getMessage()); } } } } } /** * Choose a size appropriate to store the given number of bytes, with some room for growth, when * allocating storage for mutable types or {@code Builder}. We'll be more generous than CPython * for small array sizes to avoid needless reallocation. * * @param size of storage actually needed * @return n ≥ size a recommended storage array size */ protected static final int roundUp(int size) { /* * The CPython formula is: size + (size >> 3) + (size < 9 ? 3 : 6). But when the array * grows, CPython can use a realloc(), which will often be able to extend the allocation * into memory already secretly allocated by the initial malloc(). Extension in Java means * that we have to allocate a new array of bytes and copy to it. */ final int ALLOC = 16; // Must be power of two! final int SIZE2 = 10; // Smallest size leading to a return value of 2*ALLOC if (size >= SIZE2) { // Result > ALLOC, so work it out // Same recommendation as CPython, but rounded up to multiple of ALLOC return (size + (size >> 3) + (6 + ALLOC - 1)) & ~(ALLOC - 1); } else if (size > 0) { // Easy: save arithmetic return ALLOC; } else { // Very easy return 0; } } }