X Tutup
The Wayback Machine - https://web.archive.org/web/20220524204906/https://github.com/PowerShell/PowerShell/issues/1908
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don’t parse the pipeline as text when it is directed from an EXE to another EXE or file. Keep the bytes as-is. #1908

Open
be5invis opened this issue Aug 18, 2016 · 31 comments
Assignees
Labels
Issue-Bug WG-Engine WG-Engine-Performance
Milestone

Comments

@be5invis
Copy link

@be5invis be5invis commented Aug 18, 2016

Currently PowerShell parses STDOUT as string when piping from an EXE, while in some cases it should be preserved as a byte stream, like this scenario:

curl.exe http://whatever/a.png > a.png

or

node a.js | gzip -c > out.gz

Affected patterns include: native | native, native > file and (maybe) cat file | native.

@be5invis be5invis changed the title Don Don‘t parse the pipeline when it is redirected from an EXE to another EXE. Keep the bytes as-is. Aug 18, 2016
@be5invis be5invis changed the title Don‘t parse the pipeline when it is redirected from an EXE to another EXE. Keep the bytes as-is. Don‘t parse the pipeline when it is redirected from an EXE to another EXE or file saving. Keep the bytes as-is. Aug 18, 2016
@lzybkr lzybkr added Issue-Bug WG-Engine labels Aug 18, 2016
@be5invis
Copy link
Author

@be5invis be5invis commented Aug 18, 2016

@vors @lzybkr
The current NativeCommandProcessor breaks:

  • LF line endings.
  • Non-ASCII text within UTF-8 without BOM header.
  • Binary file redirects (like curl.exe’s output).
  • > layouts text into 80 columns by default.

@be5invis be5invis changed the title Don‘t parse the pipeline when it is redirected from an EXE to another EXE or file saving. Keep the bytes as-is. Don’t parse the pipeline as text when it is directed from an EXE to another EXE or file. Keep the bytes as-is. Aug 18, 2016
@ForNeVeR
Copy link

@ForNeVeR ForNeVeR commented Aug 25, 2016

Maybe add a cmdlet/operator to call native command and get its raw output (as a byte array / stream?), something like this:

# Consider ^& operator is an alias for Get-CommandRawOutputStream; this is just an example syntax
$output = ^& curl.exe http://whatever/a.png # $output now is a byte array or stream
$output > C:\Temp\file.png # file.png now is a valid image file

# This should be valid, too:
^& curl.exe http://whatever/a.png > C:\Temp\file.png

This opens an opportunity for some additional usage patterns (you can put this raw content into variables, and pipe raw content from native commands to managed cmdlets).

@ForNeVeR
Copy link

@ForNeVeR ForNeVeR commented Aug 25, 2016

But maybe we could add a special kind of redirection operator (like 2>&1, 3>&1, *>&1 we already have), something like this (where %>&1 is a new redirection operator that redirects command "raw output" without processing it as a string):

$output = curl.exe http://whatever/a.png %>&1
$output > C:\Temp\file.png

# Or even this:
curl.exe http://whatever/a.png %> C:\Temp\file.png # which is just awesome

Overall: I don't think that this kind of redirection should be tied to only native commands or some limited list of usage patterns (e.g. native | native).

@be5invis
Copy link
Author

@be5invis be5invis commented Aug 25, 2016

@ForNeVeR My proposal is that:

  1. For native | native, keep the bytes as-is. This is already purposed by @vors.
  2. For ps | native, add a set of cmdlets which encodes PS objects into bytes, perhaps ps | encode-text utf-8 | native.
  3. For native | ps, we can use the type system to identify whether a cmdlet accepts “raw input”. For cmdlets like out-file or maybe decode-text, it will keep the bytes from native, and other cmdlets will use the parsed string as its input.

@ForNeVeR
Copy link

@ForNeVeR ForNeVeR commented Aug 26, 2016

@be5invis okay, it seems like this proposal also supports all the relevant use cases I can imagine.

@GeeLaw
Copy link

@GeeLaw GeeLaw commented Sep 2, 2016

Shouldn't this open up an RFC since this is a breaking change (changes the observed behaviour)?

A workaround for this is to provide a cmdlet that stores the content in a temporary file. A working example is Use-RawPipeline in PowerShell Gallery. The current implementation is to store the file, but it could also be streamlined so that the file doesn't have to be stored.

@jhclark
Copy link

@jhclark jhclark commented Sep 6, 2016

See also #559, where this appears to be actively discussed and worked on by @vors on the PowerShell team.

@vors vors self-assigned this Sep 12, 2016
@vors vors added this to the 6.0.0 milestone Sep 12, 2016
@vors vors added the WG-Engine-Performance label Sep 12, 2016
@vors
Copy link
Collaborator

@vors vors commented Sep 12, 2016

Great discussion! Thank you all for the feedback.

I'd like to share my plans about this work:

  • In the scope if this issue we will address only native | native and native > file behavior. Note, that although it could be seen as a breaking change, it would not be the case for text output. The behavior would be preserved. Byte output would be much more reliable without wrapping bytes in PS strings. We agreed with @lzybkr that it's not breaking, hence no RFC process would be applied.
  • I don't see the immediate need in enhancing native | ps case, since PS is able to consume strings only from the native commands. Although, somebody may want to write function like
function foo
{
  param([byte[]]$rawBytes)
}

they may archive it with a temp file or some other technique as @GeeLaw pointed out.

  • Similarly, ps | native case has a well established pattern: when ps objects need to be passed to the native command, we apply implicit Out-String and pass everything as a text.
    Because PS doesn't use byte streams as a primitive for pipeline, I don't think we should develop special sugar to support it in the language directly. If there is a case, when it needs to be done, similar work-arounds can be used.

We can revisit the last two parts later, but I'd like to set expectations about scope of this issue.

@be5invis
Copy link
Author

@be5invis be5invis commented Sep 12, 2016

@vors However the current “>” is identical to out-file, so you have to add a special version of out-file which takes raw bytes. So why don’t you give the ability to everyone?

@daxian-dbw
Copy link
Member

@daxian-dbw daxian-dbw commented Jun 24, 2021

Re-assigned to @JamesWTruher, who will work on this in 7.3 development cycle.

@AE1020
Copy link

@AE1020 AE1020 commented Nov 24, 2021

I am interested to know how many people who want the "as-is" bytes are also in the camp of people who are using LF instead of CR LF on their Windows PowerShell scripts...

I've proposed a mode that guides behavior based on detection of LF vs. CR LF of the script containing the pipe/redirect:

Option for LF vs CR LF Piping To Match Line Endings of Running Script #16511

It may be (?) that someone's feelings about the importance of "as-is" redirection vs CR LF is effectively captured by their git autocrlf setting. So this would piggyback on that.

@UberKluger
Copy link

@UberKluger UberKluger commented Dec 26, 2021

Perhaps a useful implementation would recognise the historical significance of a [byte] stream as it pertains to native applications. Three scenarios would need to be considered:

  1. Whenever a stream of actual [byte]s is passed into a pipeline to a native program (e.g. Get-Content -AsByteStream somefile | native) then no Out-String conversion is applied, the bytes are streamed as-is. It seems unlikely that many native apps would be expecting a sequence of decimal representations of the value of the byte, one per line, and this could easily be created by simply converting the [byte] stream into an [int] stream (which would then still use Out-String). Of course, there would be no practical way for PS to know that a stream consists entirely of [byte]s unless the (single) object passed were a [byte[]] (which would then be enumerated into a [byte] stream). It would be a matter for debate as to whether only this particular case would trigger the special behaviour or any stream with an initial [byte] would trigger it but then throw a terminating error if a non-[byte] were passed (similar to Set-Content -AsByteStream which has the same issue when taking input [object]s from the pipeline).

  2. For the case where a native program is the pipeline input to a cmdlet (e.g. native | Sort-Object), the current behaviour is retained since often (if not mostly) the native program's output will be (typ. ASCII) text. Bytes are converted to unicode characters by whatever mapping is appropriate and then collected into [string]s which are passed to the cmdlet. However, similar to the way commandline arguments are parsed into objects but the original text is retained in case a native command is being invoked, each [string] object would be wrapped in a [psobject] (or just have an added property) which would contain the original [byte]s received from the native program (maybe a [byte[]] or a [string] with the original [byte]s collected but not mapped to unicode equivalents). This would be invisible to existing cmdlets (possibly directly accessible if desired via a public property) but a new filter cmdlet (ConvertTo-ByteStream or cbs perhaps) would be provided which would restore the original output into a [byte] stream which could then be passed into the remaining pipeline. This would also cover the case of saving native output into a file by simply doing native | ConvertTo-ByteStream | Set-Content -AsByteStream (the behaviour of > would be unaltered to minimise potential breaking changes). Of course, for native programs that produced raw binary data (i.e. not character strings of any form), very big (but unused) [string]s could be produced, potentially impacting upon both processor and memory usage. Whether some limit would be imposed on the size of [string]s built from native output is a topic for debate. Any such limit would not affect true [byte] streams, just the form of [string]s built from that stream which might terminate before an "end of line" (a meaningless term for raw binary data).

  3. For native | native ( | native ...), I would suggest that the basic ethos of the PS pipeline (passing objects) is totally inappropriate as such programs are not suited for (nor even aware of) the PS pipeline. For this case (only), the original cmd (and Unix™) behaviour should be restored. Each program would be started in its own process (as currently) with a (Windows/Unix™) anonymous byte stream pipe connecting StdOut of each to StdIn of the next (if any). Any sequence of native | native (| native ...) within a larger pipeline would be treated as a single native program with one input stream from the PS pipeline and one output stream back to the PS pipeline as per items 1 and 2 above, e.g.

    Get-Content -AsByteStream somefile | native | native | native | Sort-Object | more
    

    would be treated as

    Get-Content -AsByteStream somefile | ( one native doing 3 things ) | Sort-Object | more
    

    While this might seem to be the breakingest change possible, I suspect that the vast majority of native programs (with input / output suitable for piping) would have been designed around this behaviour (byte stream pipes, the type obtained from kernel APIs). Certainly, anything intended for invocation by cmd would expect it. In particular, find.exe, findstr.exe and sort.exe do (and also don't like Unicode). Other programs that might have the ability to process Unicode would either utilise a command line option, a BOM (LE / BE) as the first two bytes read or (less predictable) DBCS heuristics, e.g. expecting a lot of alternating 0 bytes for mostly ASCII characters or maybe a LE/BE-Unicode Space, TAB or new-line within the first 100 byte pairs (I'm looking at you, Scripting.FileSystemObject), but they would still read a byte stream (as paired bytes). Further on the "make it work like cmd/Unix™", any redirections of file handles within a native pipeline grouping (native | native | ...) would need to operate on the actual process handles as there would be no "PS pipeline" within the grouping, e.g.

    Get-Content -AsByteStream infile | native1 2> native1.err | native2 2> native2.err | ConvertTo-ByteStream | Set-Content -AsByteStream outfile
    

    Here, native1.err and native2.err are connected directly to file handle 2 of native1 and native2, respectively. Alternately,

    Get-Content -AsByteStream infile | native1 2>&1 | native2 2>&1 | ConvertTo-ByteStream | Set-Content -AsByteStream outfile
    

    In this case, streams are merged with file handle 2 of both native1 and native2 being duplicated from each's file handle 1 (their respective anonymous output pipes). Whether (on Unix™ like systems) being able to open/redirect/merge file handles other than 2 (à la sh(1) and derivatives) would be desirable (feasible?) within a native pipeline grouping is less clear. There would be few, if any, native programs (on any system) that expected anything beyond StdIn, StdOut and StdErr.

I believe that the preceding could go a long way toward resolving the apparent discontinuity between (legacy) program simple byte oriented CLI I/O and the more powerful but more complicated object oriented PS pipeline. While some potentially breaking changes are involved, I suspect these would mostly affect strategies (kludges?) used to work around the incompatibility issues addressed here. Further, the changes proposed allow original (incompatible) behaviour to be maintained at the PS script level. For native output, no change will occur unless the new ConvertTo-ByteStream cmdlet is used. For input, simply changing a [byte] stream to an [int] stream (of the same values) will restore previous behaviour. For native to native pipes, interposing an explicit Out-String between each native in the pipeline should restore previous behaviour by passing the data through PS, with the associated conversions (not 100% sure about this one).

@huettenhain
Copy link

@huettenhain huettenhain commented Jan 22, 2022

One way to avoid any magic would be that PowerShell inspects a list of exceptions that the user can configure themselves. It could be as simple as a JSON file stored at a specific location in your user profile:

{
    "exceptions": [
        {
            "path": "C:\\Python39\\python.exe",
            "stdin": "native",
            "stdout": "native",
            "stderr": "native"
        },
    ]
}

Obviously, the exact format and location of the configuration isn't so relevant, but this would make it an easy to configure opt-in feature that does not disrupt the way PowerShell works by default.

@iSazonov
Copy link
Collaborator

@iSazonov iSazonov commented Jan 24, 2022

One way to avoid any magic would be that PowerShell inspects a list of exceptions

General proposal is in #13428

@hez2010
Copy link
Contributor

@hez2010 hez2010 commented Mar 7, 2022

Any updates? Seems that without this we cannot use gzip to compress data stream:

cat -AsByteStream a.js | gzip > a.gz

The data stream will be corrupted by the >.

This command works on every shell except PowerShell.

@vexx32
Copy link
Collaborator

@vexx32 vexx32 commented Mar 7, 2022

@JamesWTruher is this still on your radar?

@hez2010
Copy link
Contributor

@hez2010 hez2010 commented May 5, 2022

Can we expect this being committed in 7.3? Whenever I want to write bytes to native executables via pipeline, I have to launch a cmd or bash shell to achieve this, which makes PowerShell really hard to use, and, useless in the scenario where binary data processing against native executables is usual in a script.

@lewis-yeung
Copy link

@lewis-yeung lewis-yeung commented May 24, 2022

This has annoyed me for a long time. 💢 I believe there are more PS users feeling confused about it. However it's so frustrating that PR #15861 was closed. It's 2022 now, and unfortunately we still cannot get rid of the object-based stream redirection for native executables. 😭

@vexx32 vexx32 assigned JamesWTruher and unassigned rjmholt May 24, 2022
@mitchcapper
Copy link

@mitchcapper mitchcapper commented May 24, 2022

lewis-yeung, it can be quite the pain. The workaround of https://github.com/GeeLaw/PowerShellThingies/tree/master/modules/Use-RawPipeline requires rewriting your commands but can deliver performance for many situations I have found.

GitHub
My PowerShell thingies. Contribute to GeeLaw/PowerShellThingies development by creating an account on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue-Bug WG-Engine WG-Engine-Performance
Projects
Linux/Mac Usability
  
Priority-High (???)
Development

Successfully merging a pull request may close this issue.

X Tutup