# HG changeset patch # User markus schnalke # Date 1438715050 -7200 # Node ID 5cefcfc72d42a5ab3c9434434c9226e41911d600 # Parent 3b4e53e0495885ba1f9a017c3a5df52ca02c268e Added first version of the translation to English diff -r 3b4e53e04958 -r 5cefcfc72d42 cut.en.ms --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/cut.en.ms Tue Aug 04 21:04:10 2015 +0200 @@ -0,0 +1,493 @@ +.so macros +.lc_ctype en_US.utf8 +.pl -4v + +.TL +Cut out selected fields of each line of a file +.AU +markus schnalke +.. +.FS +2015-05. +This text is in the public domain (CC0). +It is available online: +.I http://marmaro.de/docs/ +.FE + +.LP +Cut is a classic program in the Unix toolchest. +It is present in most tutorials on shell programming, because it +is such a nice and useful tool which good explanationary value. +This text shall take a look behind its surface. +.SH +Usage +.LP +Initially, cut had two operation modes, which were amended by a +third one, later. Cut may cut specified characters out of the +input lines or it may cut out specified fields, which are defined +by a delimiting character. +.PP +The character mode is well suited to slice fixed-width input +formats into parts. One might, for instance, extract the access +rights from the output of \f(CWls -l\fP, here the rights of the +file's owner: +.CS + $ ls -l foo + -rw-rw-r-- 1 meillo users 0 May 12 07:32 foo +.sp .3 + $ ls -l foo | cut -c 2-4 + rw- +.CE +.LP +Or the write permission for the owner, the group and the +world: +.CS + $ ls -l foo | cut -c 3,6,9 + ww- +.CE +.LP +Cut can also be used to shorten strings: +.CS + $ long=12345678901234567890 +.sp .3 + $ echo "$long" | cut -c -10 + 1234567890 +.CE +.LP +This command outputs no more than the first 10 characters of +\f(CW$long\fP. (Alternatively, on could use \f(CWprintf +"%.10s\\n" "$long"\fP for this job.) +.PP +However, if it's not about displaying characters but about their +storing, then \f(CW-c\fP is only partly suited. In former times, +when US-ASCII had been the omnipresent character encoding, each +character was stored with exactly one byte. Therefore, \f(CWcut +-c\fP selected both, output characters and bytes, equally. With +the uprise of multi-byte encodings (like UTF-8), this assumption +became obsolete. Consequently, a byte mode (option \f(CW-b\fP) +was added to cut, with POSIX.2-1992. To select the first up to +500 bytes of each line (and ignore the rest), one can use: +.CS + $ cut -b -500 +.CE +.LP +The remainder can be caught with \f(CWcut -b 501-\fP. This +possibility is important for POSIX, because it allows to create +text files with limited line length +.[[ http://pubs.opengroup.org/onlinepubs/9699919799/utilities/cut.html#tag_20_28_17 . +.PP +Although the byte mode was newly introduced, it was meant to +behave exactly as the old character mode. The character mode, +however, had to be implemented differently. In consequence, +the problem wasn't to support the byte mode, but to support the +new character mode correctly. +.PP +Besides the character and byte modes, cut has the field mode, +which is activated by \f(CW-f\fP. It selects fields from the +input. The delimiting character (by default, the tab) may be +changed using \f(CW-d\fP. It applies to the input as well as to +the output. +.PP +The typical example for the use of cut's field mode is the +selection of information from the passwd file. Here, for +instance, the username and its uid: +.CS + $ cut -d: -f1,3 /etc/passwd + root:0 + bin:1 + daemon:2 + mail:8 + ... +.CE +.LP +(The values to the command line switches may be appended directly +to them or separated by whitespace.) +.PP +The field mode is suited for simple tabulary data, like the +passwd file. Beyond that, it soon reaches its limits. Especially, +the typical case of whitespace-separated fields is covered poorly +by it. Cut's delimiter is exactly one character, +therefore one may not split at both, space and tab characters. +Furthermore, multiple adjacent delimiter characters lead to +empty fields. This is not the expected behavior for +the processing of whitespace-separated fields. Some +implementations, e.g. the one of FreeBSD, have extensions that +handle this case in the expected way. Apart from that, i.e. +if one likes to stay portable, awk comes to rescue. +.PP +Awk provides another function that cut misses: Changing the order +of the fields in the output. For cut, the order of the field +selection specification is irrelevant; it doesn't even matter if +fields are given multiple times. Thus, the invocation +\f(CWcut -c 5-8,1,4-6\fP outputs the characters number +1, 4, 5, 6, 7 and 8 in exactly this order. The +selection is like in the mathematical set theory: Each +specified field is part of the solution set. The fields in the +solution set are always in the same order as in the input. To +speak with the words of the man page in Version 8 Unix: +``In data base parlance, it projects a relation.'' +.[[ http://man.cat-v.org/unix_8th/1/cut +This means, cut applies the database operation \fIprojection\fP +to the text input. Wikipedia explains it in the following way: +``In practical terms, it can be roughly thought of as picking a +sub-set of all available columns.'' +.[[ https://en.wikipedia.org/wiki/Projection_(relational_algebra) + +.SH +Historical Background +.LP +Cut came to public life in 1982 with the release of UNIX System +III. Browsing through the sources of System III, one finds cut.c +with the timestamp 1980-04-11 +.[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=SysIII/usr/src/cmd . +This is the oldest implementation of the program, I was able to +discover. However, the SCCS-ID in the source code speaks of +version 1.5. According to Doug McIlroy +.[[ http://minnie.tuhs.org/pipermail/tuhs/2015-May/004083.html , +the earlier history likely lays in PWB/UNIX, which was the +basis for System III. In the available sources of PWB 1.0 (1977) +.[[ http://minnie.tuhs.org/Archive/PDP-11/Distributions/usdl/ , +no cut is present. Of PWB 2.0, no sources or useful documentation +seem to be available. PWB 3.0 was later renamed to System III +for marketing purposes, hence it is identical to it. A side line +of PWB was CB UNIX, which was only used in the Bell Labs +internally. The manual of CB UNIX Edition 2.1 of November 1979 +contains the earliest mentioning of cut, that my research brought +to light: A man page for it +.[[ ftp://sunsite.icm.edu.pl/pub/unix/UnixArchive/PDP-11/Distributions/other/CB_Unix/cbunix_man1_02.pdf . +.PP +Now a look on BSD: There, my earliest discovery is a cut.c with +the file modification date of 1986-11-07 +.[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-UWisc/src/usr.bin/cut +as part of the special version 4.3BSD-UWisc +.[[ http://gunkies.org/wiki/4.3_BSD_NFS_Wisconsin_Unix , +which was released in January 1987. +This implementation is mostly identical to the one in System +III. The better known 4.3BSD-Tahoe (1988) does not contain cut. +The following 4.3BSD-Reno (1990) does include cut. It is a freshly +written one by Adam S. Moskowitz and Marciano Pitargue, which was +included in BSD in 1989 +.[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut . +Its man page +.[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut/cut.1 +already mentions the expected compliance to POSIX.2. +One should note that POSIX.2 was first published in +September 1992, about two years after the man page and the +program were written. Hence, the program must have been +implemented based on a draft version of the standard. A look into +the code confirms the assumption. The function to parse the field +selection includes the following comment: +.QP +This parser is less restrictive than the Draft 9 POSIX spec. +POSIX doesn't allow lists that aren't in increasing order or +overlapping lists. +.LP +Draft 11.2 of POSIX (1991-09) requires this flexibility already: +.QP +The elements in list can be repeated, can overlap, and can +be specified in any order. +.LP +The same draft additionally includes all three operation modes, +whereas this early BSD cut only implemented the original two. +Draft 9 might not have included the byte mode. Without access to +Draft 9 or 10, it wasn't possible to verify this guess. +.PP +The version numbers and change dates of the older BSD +implementations are manifested in the SCCS-IDs, which the +version control system of that time inserted. For instance +in 4.3BSD-Reno: ``5.3 (Berkeley) 6/24/90''. +.PP +The cut implementation of the GNU coreutils contains the +following copyright notice: +.CS + Copyright (C) 1997-2015 Free Software Foundation, Inc. + Copyright (C) 1984 David M. Ihnat +.CE +.LP +The code does have pretty old origins. Further comments show that +the source code was reworked by David MacKenzie first and later +by Jim Meyering, who put it into the version control system in +1992. It is unclear, why the years until 1997, at least from +1992 on, don't show up in the copyright notice. +.PP +Despite all those year numbers from the 80s, cut is a rather +young tool, at least in relation to the early Unix. Despite +being a decade older than Linux, the kernel, Unix had been +present for over ten years until cut appeared for the first +time. Most notably, cut wasn't part of Version 7 Unix, which +became the basis for all modern Unix systems. The more complex +tools sed and awk had been part of it already. Hence, the +question comes to mind, why cut was written at all, as there +existed two programs that were able to cover the use cases of +cut. On reason for cut surely was its compactness and the +resulting speed, in comparison to the then bulky awk. This lean +shape goes well with the Unix philosopy: Do one job and do it +well! Cut convinced. It found it's way to other Unix variants, +it became standardized and today it is present everywhere. +.PP +The original variant (without \f(CW-b\fP) was described by the +System V Interface Defintion, an important formal description +of UNIX System V, already in 1985. In the following years, it +appeared in all relevant standards. POSIX.2 in 1992 specified +cut for the first time in its modern form (with \f(CW-b\fP). + +.SH +Multi-byte support +.LP +The byte mode and thus the multi-byte support of +the POSIX character mode are standardized since 1992. But +how about their presence in the available implementations? +Which versions do implement POSIX correctly? +.PP +The situation is divided in three parts: There are historic +implementations, which have only \f(CW-c\fP and \f(CW-f\fP. +Then there are implementations, which have \f(CW-b\fP but +treat it as an alias for \f(CW-c\fP only. These +implementations work correctly for single-byte encodings +(e.g. US-ASCII, Latin1) but for multi-byte encodings (e.g. +UTF-8) their \f(CW-c\fP behaves like \f(CW-b\fP (and +\f(CW-n\fP is ignored). Finally, there are implementations +that implement \f(CW-b\fP and \f(CW-c\fP POSIX-compliant. +.PP +Historic two-mode implementations are the ones of +System III, System V and the BSD ones until the mid-90s. +.PP +Pseudo multi-byte implementations are provided by GNU and +modern NetBSD and OpenBSD. The level of POSIX compliance +that is presented there is often higher than the level of +compliance that is actually provided. Sometimes it takes a +close look to discover that \f(CW-c\fP and \f(CW-n\fP don't +behave as expected. Some of the implementations take the +easy way by simply being ignorant to any multi-byte +encodings, at least they tell that clearly: +.QP +Since we don't support multi-byte characters, the \f(CW-c\fP and \f(CW-b\fP +options are equivalent, and the \f(CW-n\fP option is meaningless. +.[[ http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/cut/cut.c?rev=1.18&content-type=text/x-cvsweb-markup +.LP +Standard-adhering implementations, ones that treat +multi-byte characters correctly, are the one of the modern +FreeBSD and the one in the Heirloom toolchest. Tim Robbins +reimplemented the character mode of FreeBSD cut, +conforming to POSIX, in summer 2004 +.[[ https://svnweb.freebsd.org/base?view=revision&revision=131194 . +The question, why the other BSD systems have not +integrated this change, is an open one. Maybe the answer an be +found in the above quoted statement. +.PP +How does a user find out if the cut on the own system handles +multi-byte characters correclty? First, one needs to check if +the system itself uses multi-byte characters, because otherwise +characters and bytes are equivalent and the question +is irrelevant. One can check this by looking at the locale +settings, but it is easier to print a typical multi-byte +character, for instance an Umlaut or the Euro currency +symbol, and check if one or more bytes are output: +.CS + $ echo ä | od -c + 0000000 303 244 \\n + 0000003 +.CE +.LP +In this case it were two bytes: octal 303 and 244. (The +Newline character is added by echo.) +.PP +The program iconv converts text to specific encodings. This +is the output for Latin1 and UTF-8, for comparison: +.CS + $ echo ä | iconv -t latin1 | od -c + 0000000 344 \\n + 0000002 +.sp .3 + $ echo ä | iconv -t utf8 | od -c + 0000000 303 244 \\n + 0000003 +.CE +.LP +The output (without the iconv conversion) on many European +systems equals one of these two. +.PP +Now the test of the cut implementation. On a UTF-8 system, a +POSIX compliant implementation behaves as such: +.CS + $ echo ä | cut -c 1 | od -c + 0000000 303 244 \\n + 0000003 +.sp .3 + $ echo ä | cut -b 1 | od -c + 0000000 303 \\n + 0000002 +.sp .3 + $ echo ä | cut -b 1 -n | od -c + 0000000 \\n + 0000001 +.CE +.LP +A pseudo POSIX implementation, in contrast, behaves like the +middle one, for all three invocations: Only the first byte is +output. + +.SH +Implementations +.LP +Let's take a look at the sources of a selection of +implementations. +.PP +A comparison of the amount of source code is good to get a first +impression. Typically, it grows through time. This can be seen +here, in general but not in all cases. A POSIX-compliant +implementation of the character mode requires more code, thus +these implementations are rather the larger ones. +.TS +center; +r r r l l l. +SLOC Lines Bytes Belongs to File tyime Category +_ +116 123 2966 System III 1980-04-11 historic +118 125 3038 4.3BSD-UWisc 1986-11-07 historic +200 256 5715 4.3BSD-Reno 1990-06-25 historic +200 270 6545 NetBSD 1993-03-21 historic +218 290 6892 OpenBSD 2008-06-27 pseudo-POSIX +224 296 6920 FreeBSD 1994-05-27 historic +232 306 7500 NetBSD 2014-02-03 pseudo-POSIX +340 405 7423 Heirloom 2012-05-20 POSIX +382 586 14175 GNU coreutils 1992-11-08 pseudo-POSIX +391 479 10961 FreeBSD 2012-11-24 POSIX +588 830 23167 GNU coreutils 2015-05-01 pseudo-POSIX +.TE +.LP +Roughly four groups can be seen: (1) The two original +implementaions, which are mostly identical, with about 100 +SLOC. (2) The five BSD versions, with about 200 SLOC. (3) The +two POSIX-compliant versions and the old GNU one, with a SLOC +count in the 300s. And finally (4) the modern GNU cut with +almost 600 SLOC. +.PP +The variation between the number of logical code +lines (SLOC, meassured with SLOCcount) and the number of +Newlines in the file (\f(CWwc -l\fP) spans between factor +1.06 for the oldest versions and factor 1.5 for GNU. The +largest influence on it are empty lines, pure comment lines +and the size of the license block at the beginning of the file. +.PP +Regarding the variation between logical code lines and the +file size (\f(CWwc -c\fP), the implementations span between +25 and 30 bytes per statement. With only 21 bytes per +statement, the Heirloom implementation marks the lower end; +the GNU implementation sets the upper limit at nearly 40. In +the case of GNU, the reason is mainly their coding style, with +special indent rules and long identifiers. Whether one finds +the Heirloom implementation +.[[ http://heirloom.cvs.sourceforge.net/viewvc/heirloom/heirloom/cut/cut.c?revision=1.6&view=markup +highly cryptic or exceptionally elegant, shall be left +open to the judgement of the reader. Especially the +comparison to the GNU implementation +.[[ http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/cut.c;hb=e981643 +is impressive. +.PP +The internal structure of the source code (in all cases it is +written in C) is mainly similar. Besides the mandatory main +function, which does the command line argument processing, +there usually exists a function to convert the field +selection specification to an internal data structure. +Further more, almost all implementations have separate +functions for each of their operation modes. The POSIX-compliant +versions treat the \f(CW-b -n\fP combination as a separate +mode and thus implement it in an own function. Only the early +System III implementation (and its 4.3BSD-UWisc variant) do +everything, apart from error handling, in the main function. +.PP +Implementations of cut typically have two limiting aspects: +One being the maximum number of fields that can be handled, +the other being the maximum line length. On System III, both +numbers are limited to 512. 4.3BSD-Reno and the BSDs of the +90s have fixed limits as well (\f(CW_BSD_LINE_MAX\fP or +\f(CW_POSIX2_LINE_MAX\fP). Modern FreeBSD, NetBSD, all GNU +implementations and the Heirloom cut is able to handle +arbitrary numbers of fields and line lengths \(en the memory +is allocated dynamically. OpenBSD cut is a hybrid: It has a fixed +maximum number of fields, but allows arbitrary line lengths. +The limited number of fields does, however, not appear to be +any practical problem, because \f(CW_POSIX2_LINE_MAX\fP is +guaranteed to be at least 2048 and is thus probably large enough. + +.SH +Descriptions +.LP +Interesting, as well, is a comparison of the short descriptions +of cut, as can be found in the headlines of the man +pages or at the beginning of the source code files. +The following list is roughly sorted by time and grouped by +decent: +.TS +center; +l l. +CB UNIX cut out selected fields of each line of a file +System III cut out selected fields of each line of a file +System III \(dg cut and paste columns of a table (projection of a relation) +System V cut out selected fields of each line of a file +HP-UX cut out (extract) selected fields of each line of a file +.sp .3 +4.3BSD-UWisc \(dg cut and paste columns of a table (projection of a relation) +4.3BSD-Reno select portions of each line of a file +NetBSD select portions of each line of a file +OpenBSD 4.6 select portions of each line of a file +FreeBSD 1.0 select portions of each line of a file +FreeBSD 10.0 cut out selected portions of each line of a file +SunOS 4.1.3 remove selected fields from each line of a file +SunOS 5.5.1 cut out selected fields of each line of a file +.sp .3 +Heirloom Tools cut out selected fields of each line of a file +Heirloom Tools \(dg cut out fields of lines of files +.sp .3 +GNU coreutils remove sections from each line of files +.sp .3 +Minix select out columns of a file +.sp .3 +Version 8 Unix rearrange columns of data +``Unix Reader'' rearrange columns of text +.sp .3 +POSIX cut out selected fields of each line of a file +.TE +.LP +(The descriptions that are marked with `\(dg' were taken from +source code files. The POSIX entry contains the description +used in the standard. The ``Unix Reader'' is a retrospective +document by Doug McIlroy, which lists the availability of +tools in the Research Unix versions +.[[ http://doc.cat-v.org/unix/unix-reader/contents.pdf . +Its description should actually match the one in Version 8 +Unix. The change could be a transfer mistake or a correction. +All other descriptions originate from the various man pages.) +.PP +Over time, the POSIX description was often adopted or it +served as inspiration. One such example is FreeBSD +.[[ https://svnweb.freebsd.org/base?view=revision&revision=167101 . +.PP +It is noteworthy that the GNU coreutils in all versions +describe the performed action as a removal of parts of the +input, although the user clearly selects the parts that are +output. Probably the words ``cut out'' are too misleading. +HP-UX concretized them. +.PP +There are also different terms used for the thing being +selected. Some talk about fields (POSIX), some talk +about portions (BSD) and some call it columns (Research +Unix). +.PP +The seemingly least adequate description, the one of Version +8 Unix (``rearrange columns of data'') is explainable in so +far that the man page covers both, cut and paste, and in +their combination, columns can be rearranged. The use of +``data'' instead of ``text'' might be a lapse, which McIlroy +corrected in his Unix Reader ... but, on the other hand, on +Unix, the two words are mostly synonymous, because all data +is text. + + +.SH +Referenzen +.LP +.nf +._r +