I've also been writing Python versions of some other Linux commands, and will post about them here some time later.
Also, for saving (and then viewing) man pages (about commands or other topics), without the formatting characters (such as ^H), on some Linux distros, I find this useful:
m, a Unix shell utility to save cleaned-up man pages as text:
> Also, for saving (and then viewing) man pages (about commands or other topics), without the formatting characters (such as ^H), on some Linux distros, I find this useful:
^H is one of the two backspace characters (the other being ^?) depending on which VT standard you're configured to use. So you shouldn't really need a dedicate script to do that as it's just a problem between your pseudo-TTY (Linux console) and your terminal emulator (PuTTY / iTerm / xterm / etc).
Thankfully both are easily configurable. However without knowing which terminal emulator you're using I couldn't walk you throw configuring that. In terms of the TTY; you can change the backspace character via the `stty` command:
stty erase [hit backspace on your keyboard]
Another thought is it might be a case that the TTY is largely configured correctly already but you're overwriting your environmental variables with non-standard values (eg the $TERM var with a value that differs from the actual terminal you're using) which is causing the pager (`less` / `more` etc. `man` will use this for the paging) to break the standard your terminal is expecting. But either way, this is definitely a configuration problem rather than something that should be fixed with additional parsing scripts.
> you shouldn't really need a dedicate script to do that
It isn't because the terminal is mis-configured and/or mis-interpreting those characters, it's because these characters have historically been used in a special way by troff's ascii output driver, as used by the man command.
Bold characters are "emulated" by outputting the character itself, followed by ^H, then by repeating the character, and underline is emulated by outputting _ (the underscore character) followed by ^H, followed by the character to be underlined. This is the same way that bold/underline was achieved on a manual typewriter or by old teletype terminals with paper output.
Pagers like "more" & "less" have in-built behaviour that knows how to interpret these sequences and render bold or underline appropriately, and if you "cat" the file your terminal would probably ignore them, but if you open a file with those sequences in a text editor, you're going to see a bunch of unnecessary ^H characters. The OP's script uses the "col" command to remove the unnecessary ^Bs (and the preceding character) that troff has output.
By default, GNU groff doesn't actually output those sequences anymore and uses ANSI escape codes instead. AFAIK many (most?) distributions actually compile out the ANSI behaviour in favour of the old way though (because "more" and "less" don't actually behave correctly with the ANSI characters by default), but some don't, which breaks the script (it's broken in Cygwin, for instance).
FWIW, if you need a fool-proof way to convert a man page to plain old ASCII with no escapes at all, it's easiest just to redirect the output of "man" to a file:
man ls > ls.txt
The long-winded way (with groff) is something like:
>Pagers like "more" & "less" have in-built behaviour that knows how to interpret these sequences and render bold or underline appropriately,
Right
>and if you "cat" the file your terminal would probably ignore them
I think the terminal does not ignore them. My guess is that it interprets the characters, but it has no noticeable visual effect on the screen. E.g. if the letter "c" in "cat" is output as "c^hc" (to make in bold in print), the terminal will just print "c", a backspace, then "c" again, which to the user will look the same as a single "c".
You described the issue and cause better than I did :)
I didn't know about the ANSI behavior, interesting.
>FWIW, if you need a fool-proof way to convert a man page to plain old ASCII with no escapes at all, it's easiest just to redirect the output of "man" to a file:
> man ls > ls.txt
IIRC, even with that way, the control chars still appeared in the file (at least on some Unix version), which is why I wrote the script in the first place.
> the control chars still appeared in the file (at least on some Unix version)
That's a good point. My whole reply is a bit GNU-centric: the "col -bx" solution should work with everything except groff in ANSI mode (looks like the flag to go back to the standard troff behaviour is GROFF_NO_SGR [1][2], in case anybody is interested).
Ah fair enough. The few times I have seen this issue was when jumping on old Solaris SPARC boxes which also mangled the the backspace key in the interactive terminal.
So it sounds like I've put 2 and 2 together and gotten 5.
Though coincidentally I do the redirect trick a lot as was as querying the gzipped raw files too (one of the projects I'm working on requires building a cutdown man page parser).
I first created that script called m years ago on some Unix boxes that I was using. Could have been HP-UX or other version. And I've used it over the years on many Unix and Linux versions that I worked on. It could be that in some cases, the script is needed due to a tty or TERM or other configuration issue. But I'm pretty sure that I've had the need for it (to remove those control characters) on at least some systems where such config was okay. I know this because, while I do tweak env. var., stty and other settings now and then, I do not always (need to) do so, and have still found control chars [1] in the man output, even when redirecting to file. That is why I created the script, because when working on a C, Python or any other project where I need to read man pages, I often like to redirect the man output to a file (in my ~/man dir) and then read them using vi/vim.
[1] See what I say in that post (about m) about nroff and troff.
Also, I do know about stty, have used it for years, although there is less need for it these days. Used to do a lot of tweaking and experimenting with the erase option (for which, BTW, instead of backspace key in your example, we can also write a literal ^h, i.e. caret, then letter h), intr option, onlcr, ocrnl and others. Used to be good fun and sometimes frustrating too, because docs for this areas were somewhat lacking then.
Yeah I'm familiar with groff et al. I used to get this issue on old UNIX boxes but not seen it in any of the Linux distros I use daily (not Freebase).
The other thing worth trying is seeing whether your locals are set correctly and/or set the GROFF_TYPESETTER environment variable. Though this might only by GNU specific. Or just redirect the output (as suggested by the other reply) which is how I used to get around the issue on SunOS at least.
I don't make a habit of commenting on shell scripts what I haven't first read. But thank you all the same.
I used to use 'col' alot to resolve the tabs Vs spaces problem with different actors imposing different preferences in their source code. It's a handy tool but often overlooked.
>I don't make a habit of commenting on shell scripts what I haven't first read. But thank you all the same.
Sorry, I should have added "if you haven't already read it", in my earlier comment.
>I used to use 'col' alot to resolve the tabs Vs spaces problem with different actors imposing different preferences in their source code. It's a handy tool but often overlooked.
Interesting, I may not have noticed or known about the tabs option of col. Cool. Seems it is like the entab program in an edition of the K&R C book.
This shell snippet lists all the directories in $PATH, using less, so you can page through them:
I came across the watch command recently, which seems useful, and wrote something like it in Python:A Python version of the Linux watch command:
https://jugad2.blogspot.in/2018/05/a-python-version-of-linux...
I've also been writing Python versions of some other Linux commands, and will post about them here some time later.
Also, for saving (and then viewing) man pages (about commands or other topics), without the formatting characters (such as ^H), on some Linux distros, I find this useful:
m, a Unix shell utility to save cleaned-up man pages as text:
https://jugad2.blogspot.in/2017/03/m-unix-shell-utility-to-s...