Mercurial > docs > cut

--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/cut.en.ms	Tue Aug 04 21:04:10 2015 +0200
@@ -0,0 +1,493 @@
+.so macros
+.lc_ctype en_US.utf8
+.pl -4v
+
+.TL
+Cut out selected fields of each line of a file
+.AU
+markus schnalke <meillo@marmaro.de>
+..
+.FS
+2015-05.
+This text is in the public domain (CC0).
+It is available online:
+.I http://marmaro.de/docs/
+.FE
+
+.LP
+Cut is a classic program in the Unix toolchest.
+It is present in most tutorials on shell programming, because it
+is such a nice and useful tool which good explanationary value.
+This text shall take a look behind its surface.
+.SH
+Usage
+.LP
+Initially, cut had two operation modes, which were amended by a
+third one, later. Cut may cut specified characters out of the
+input lines or it may cut out specified fields, which are defined
+by a delimiting character.
+.PP
+The character mode is well suited to slice fixed-width input
+formats into parts. One might, for instance, extract the access
+rights from the output of \f(CWls -l\fP, here the rights of the
+file's owner:
+.CS
+	$ ls -l foo
+	-rw-rw-r-- 1 meillo users 0 May 12 07:32 foo
+.sp .3
+	$ ls -l foo | cut -c 2-4
+	rw-
+.CE
+.LP
+Or the write permission for the owner, the group and the
+world:
+.CS
+	$ ls -l foo | cut -c 3,6,9
+	ww-
+.CE
+.LP
+Cut can also be used to shorten strings:
+.CS
+	$ long=12345678901234567890
+.sp .3
+	$ echo "$long" | cut -c -10
+	1234567890
+.CE
+.LP
+This command outputs no more than the first 10 characters of
+\f(CW$long\fP. (Alternatively, on could use \f(CWprintf
+"%.10s\\n" "$long"\fP for this job.)
+.PP
+However, if it's not about displaying characters but about their
+storing, then \f(CW-c\fP is only partly suited. In former times,
+when US-ASCII had been the omnipresent character encoding, each
+character was stored with exactly one byte. Therefore, \f(CWcut
+-c\fP selected both, output characters and bytes, equally. With
+the uprise of multi-byte encodings (like UTF-8), this assumption
+became obsolete. Consequently, a byte mode (option \f(CW-b\fP)
+was added to cut, with POSIX.2-1992. To select the first up to
+500 bytes of each line (and ignore the rest), one can use:
+.CS
+	$ cut -b -500
+.CE
+.LP
+The remainder can be caught with \f(CWcut -b 501-\fP. This
+possibility is important for POSIX, because it allows to create
+text files with limited line length
+.[[ http://pubs.opengroup.org/onlinepubs/9699919799/utilities/cut.html#tag_20_28_17 .
+.PP
+Although the byte mode was newly introduced, it was meant to
+behave exactly as the old character mode. The character mode,
+however, had to be implemented differently. In consequence,
+the problem wasn't to support the byte mode, but to support the
+new character mode correctly.
+.PP
+Besides the character and byte modes, cut has the field mode,
+which is activated by \f(CW-f\fP. It selects fields from the
+input. The delimiting character (by default, the tab) may be
+changed using \f(CW-d\fP. It applies to the input as well as to
+the output.
+.PP
+The typical example for the use of cut's field mode is the
+selection of information from the passwd file. Here, for
+instance, the username and its uid:
+.CS
+	$ cut -d: -f1,3 /etc/passwd
+	root:0
+	bin:1
+	daemon:2
+	mail:8
+	...
+.CE
+.LP
+(The values to the command line switches may be appended directly
+to them or separated by whitespace.)
+.PP
+The field mode is suited for simple tabulary data, like the
+passwd file. Beyond that, it soon reaches its limits. Especially,
+the typical case of whitespace-separated fields is covered poorly
+by it. Cut's delimiter is exactly one character,
+therefore one may not split at both, space and tab characters.
+Furthermore, multiple adjacent delimiter characters lead to
+empty fields. This is not the expected behavior for
+the processing of whitespace-separated fields. Some
+implementations, e.g. the one of FreeBSD, have extensions that
+handle this case in the expected way. Apart from that, i.e.
+if one likes to stay portable, awk comes to rescue.
+.PP
+Awk provides another function that cut misses: Changing the order
+of the fields in the output. For cut, the order of the field
+selection specification is irrelevant; it doesn't even matter if
+fields are given multiple times. Thus, the invocation
+\f(CWcut -c 5-8,1,4-6\fP outputs the characters number
+1, 4, 5, 6, 7 and 8 in exactly this order. The
+selection is like in the mathematical set theory: Each
+specified field is part of the solution set. The fields in the
+solution set are always in the same order as in the input. To
+speak with the words of the man page in Version 8 Unix:
+``In data base parlance, it projects a relation.''
+.[[ http://man.cat-v.org/unix_8th/1/cut
+This means, cut applies the database operation \fIprojection\fP
+to the text input. Wikipedia explains it in the following way:
+``In practical terms, it can be roughly thought of as picking a
+sub-set of all available columns.''
+.[[ https://en.wikipedia.org/wiki/Projection_(relational_algebra)
+
+.SH
+Historical Background
+.LP
+Cut came to public life in 1982 with the release of UNIX System
+III. Browsing through the sources of System III, one finds cut.c
+with the timestamp 1980-04-11
+.[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=SysIII/usr/src/cmd .
+This is the oldest implementation of the program, I was able to
+discover. However, the SCCS-ID in the source code speaks of
+version 1.5. According to Doug McIlroy
+.[[ http://minnie.tuhs.org/pipermail/tuhs/2015-May/004083.html ,
+the earlier history likely lays in PWB/UNIX, which was the
+basis for System III. In the available sources of PWB 1.0 (1977)
+.[[ http://minnie.tuhs.org/Archive/PDP-11/Distributions/usdl/ ,
+no cut is present. Of PWB 2.0, no sources or useful documentation
+seem to be available. PWB 3.0 was later renamed to System III
+for marketing purposes, hence it is identical to it. A side line
+of PWB was CB UNIX, which was only used in the Bell Labs
+internally. The manual of CB UNIX Edition 2.1 of November 1979
+contains the earliest mentioning of cut, that my research brought
+to light: A man page for it
+.[[ ftp://sunsite.icm.edu.pl/pub/unix/UnixArchive/PDP-11/Distributions/other/CB_Unix/cbunix_man1_02.pdf .
+.PP
+Now a look on BSD: There, my earliest discovery is a cut.c with
+the file modification date of 1986-11-07
+.[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-UWisc/src/usr.bin/cut
+as part of the special version 4.3BSD-UWisc
+.[[ http://gunkies.org/wiki/4.3_BSD_NFS_Wisconsin_Unix ,
+which was released in January 1987.
+This implementation is mostly identical to the one in System
+III. The better known 4.3BSD-Tahoe (1988) does not contain cut.
+The following 4.3BSD-Reno (1990) does include cut. It is a freshly
+written one by Adam S. Moskowitz and Marciano Pitargue, which was
+included in BSD in 1989
+.[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut .
+Its man page
+.[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut/cut.1
+already mentions the expected compliance to POSIX.2.
+One should note that POSIX.2 was first published in
+September 1992, about two years after the man page and the
+program were written. Hence, the program must have been
+implemented based on a draft version of the standard. A look into
+the code confirms the assumption. The function to parse the field
+selection includes the following comment:
+.QP
+This parser is less restrictive than the Draft 9 POSIX spec.
+POSIX doesn't allow lists that aren't in increasing order or
+overlapping lists.
+.LP
+Draft 11.2 of POSIX (1991-09) requires this flexibility already:
+.QP
+The elements in list can be repeated, can overlap, and can
+be specified in any order.
+.LP
+The same draft additionally includes all three operation modes,
+whereas this early BSD cut only implemented the original two.
+Draft 9 might not have included the byte mode. Without access to
+Draft 9 or 10, it wasn't possible to verify this guess.
+.PP
+The version numbers and change dates of the older BSD
+implementations are manifested in the SCCS-IDs, which the
+version control system of that time inserted. For instance
+in 4.3BSD-Reno: ``5.3 (Berkeley) 6/24/90''.
+.PP
+The cut implementation of the GNU coreutils contains the
+following copyright notice:
+.CS
+	Copyright (C) 1997-2015 Free Software Foundation, Inc.
+	Copyright (C) 1984 David M. Ihnat
+.CE
+.LP
+The code does have pretty old origins. Further comments show that
+the source code was reworked by David MacKenzie first and later
+by Jim Meyering, who put it into the version control system in
+1992. It is unclear, why the years until 1997, at least from
+1992 on, don't show up in the copyright notice.
+.PP
+Despite all those year numbers from the 80s, cut is a rather
+young tool, at least in relation to the early Unix. Despite
+being a decade older than Linux, the kernel, Unix had been
+present for over ten years until cut appeared for the first
+time. Most notably, cut wasn't part of Version 7 Unix, which
+became the basis for all modern Unix systems. The more complex
+tools sed and awk had been part of it already. Hence, the
+question comes to mind, why cut was written at all, as there
+existed two programs that were able to cover the use cases of
+cut. On reason for cut surely was its compactness and the
+resulting speed, in comparison to the then bulky awk. This lean
+shape goes well with the Unix philosopy: Do one job and do it
+well! Cut convinced. It found it's way to other Unix variants,
+it became standardized and today it is present everywhere.
+.PP
+The original variant (without \f(CW-b\fP) was described by the
+System V Interface Defintion, an important formal description
+of UNIX System V, already in 1985. In the following years, it
+appeared in all relevant standards. POSIX.2 in 1992 specified
+cut for the first time in its modern form (with \f(CW-b\fP).
+
+.SH
+Multi-byte support
+.LP
+The byte mode and thus the multi-byte support of
+the POSIX character mode are standardized since 1992. But
+how about their presence in the available implementations?
+Which versions do implement POSIX correctly?
+.PP
+The situation is divided in three parts: There are historic
+implementations, which have only \f(CW-c\fP and \f(CW-f\fP.
+Then there are implementations, which have \f(CW-b\fP but
+treat it as an alias for \f(CW-c\fP only. These
+implementations work correctly for single-byte encodings
+(e.g. US-ASCII, Latin1) but for multi-byte encodings (e.g.
+UTF-8) their \f(CW-c\fP behaves like \f(CW-b\fP (and
+\f(CW-n\fP is ignored). Finally, there are implementations
+that implement \f(CW-b\fP and \f(CW-c\fP POSIX-compliant.
+.PP
+Historic two-mode implementations are the ones of
+System III, System V and the BSD ones until the mid-90s.
+.PP
+Pseudo multi-byte implementations are provided by GNU and
+modern NetBSD and OpenBSD. The level of POSIX compliance
+that is presented there is often higher than the level of
+compliance that is actually provided. Sometimes it takes a
+close look to discover that \f(CW-c\fP and \f(CW-n\fP don't
+behave as expected. Some of the implementations take the
+easy way by simply being ignorant to any multi-byte
+encodings, at least they tell that clearly:
+.QP
+Since we don't support multi-byte characters, the \f(CW-c\fP and \f(CW-b\fP
+options are equivalent, and the \f(CW-n\fP option is meaningless.
+.[[ http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/cut/cut.c?rev=1.18&content-type=text/x-cvsweb-markup
+.LP
+Standard-adhering implementations, ones that treat
+multi-byte characters correctly, are the one of the modern
+FreeBSD and the one in the Heirloom toolchest. Tim Robbins
+reimplemented the character mode of FreeBSD cut,
+conforming to POSIX, in summer 2004
+.[[ https://svnweb.freebsd.org/base?view=revision&revision=131194 .
+The question, why the other BSD systems have not
+integrated this change, is an open one. Maybe the answer an be
+found in the above quoted statement.
+.PP
+How does a user find out if the cut on the own system handles
+multi-byte characters correclty? First, one needs to check if
+the system itself uses multi-byte characters, because otherwise
+characters and bytes are equivalent and the question
+is irrelevant. One can check this by looking at the locale
+settings, but it is easier to print a typical multi-byte
+character, for instance an Umlaut or the Euro currency
+symbol, and check if one or more bytes are output:
+.CS
+	$ echo ä | od -c
+	0000000 303 244  \\n
+	0000003
+.CE
+.LP
+In this case it were two bytes: octal 303 and 244. (The
+Newline character is added by echo.)
+.PP
+The program iconv converts text to specific encodings. This
+is the output for Latin1 and UTF-8, for comparison:
+.CS
+	$ echo ä | iconv -t latin1 | od -c
+	0000000 344  \\n
+	0000002
+.sp .3
+	$ echo ä | iconv -t utf8 | od -c
+	0000000 303 244  \\n
+	0000003
+.CE
+.LP
+The output (without the iconv conversion) on many European
+systems equals one of these two.
+.PP
+Now the test of the cut implementation. On a UTF-8 system, a
+POSIX compliant implementation behaves as such:
+.CS
+	$ echo ä | cut -c 1 | od -c
+	0000000 303 244  \\n
+	0000003
+.sp .3
+	$ echo ä | cut -b 1 | od -c
+	0000000 303  \\n
+	0000002
+.sp .3
+	$ echo ä | cut -b 1 -n | od -c
+	0000000  \\n
+	0000001
+.CE
+.LP
+A pseudo POSIX implementation, in contrast, behaves like the
+middle one, for all three invocations: Only the first byte is
+output.
+
+.SH
+Implementations
+.LP
+Let's take a look at the sources of a selection of
+implementations.
+.PP
+A comparison of the amount of source code is good to get a first
+impression.  Typically, it grows through time. This can be seen
+here, in general but not in all cases. A POSIX-compliant
+implementation of the character mode requires more code, thus
+these implementations are rather the larger ones.
+.TS
+center;
+r r r l l l.
+SLOC	Lines	Bytes	Belongs to  	File tyime	Category
+_
+116	123	 2966	System III	1980-04-11	historic
+118	125	 3038	4.3BSD-UWisc	1986-11-07	historic
+200	256	 5715	4.3BSD-Reno	1990-06-25	historic
+200	270	 6545	NetBSD	1993-03-21	historic
+218	290	 6892	OpenBSD	2008-06-27	pseudo-POSIX
+224	296	 6920	FreeBSD	1994-05-27	historic
+232	306	 7500	NetBSD 	2014-02-03	pseudo-POSIX
+340	405	 7423	Heirloom	2012-05-20	POSIX
+382	586	14175	GNU coreutils	1992-11-08	pseudo-POSIX
+391	479	10961	FreeBSD	2012-11-24	POSIX
+588	830	23167	GNU coreutils	2015-05-01	pseudo-POSIX
+.TE
+.LP
+Roughly four groups can be seen: (1) The two original
+implementaions, which are mostly identical, with about 100
+SLOC. (2) The five BSD versions, with about 200 SLOC. (3) The
+two POSIX-compliant versions and the old GNU one, with a SLOC
+count in the 300s. And finally (4) the modern GNU cut with
+almost 600 SLOC.
+.PP
+The variation between the number of logical code
+lines (SLOC, meassured with SLOCcount) and the number of
+Newlines in the file (\f(CWwc -l\fP) spans between factor
+1.06 for the oldest versions and factor 1.5 for GNU. The
+largest influence on it are empty lines, pure comment lines
+and the size of the license block at the beginning of the file.
+.PP
+Regarding the variation between logical code lines and the
+file size (\f(CWwc -c\fP), the implementations span between
+25 and 30 bytes per statement. With only 21 bytes per
+statement, the Heirloom implementation marks the lower end;
+the GNU implementation sets the upper limit at nearly 40. In
+the case of GNU, the reason is mainly their coding style, with
+special indent rules and long identifiers. Whether one finds
+the Heirloom implementation
+.[[ http://heirloom.cvs.sourceforge.net/viewvc/heirloom/heirloom/cut/cut.c?revision=1.6&view=markup
+highly cryptic or exceptionally elegant, shall be left
+open to the judgement of the reader. Especially the
+comparison to the GNU implementation
+.[[ http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/cut.c;hb=e981643
+is impressive.
+.PP
+The internal structure of the source code (in all cases it is
+written in C) is mainly similar. Besides the mandatory main
+function, which does the command line argument processing,
+there usually exists a function to convert the field
+selection specification to an internal data structure.
+Further more, almost all implementations have separate
+functions for each of their operation modes. The POSIX-compliant
+versions treat the \f(CW-b -n\fP combination as a separate
+mode and thus implement it in an own function. Only the early
+System III implementation (and its 4.3BSD-UWisc variant) do
+everything, apart from error handling, in the main function.
+.PP
+Implementations of cut typically have two limiting aspects:
+One being the maximum number of fields that can be handled,
+the other being the maximum line length. On System III, both
+numbers are limited to 512. 4.3BSD-Reno and the BSDs of the
+90s have fixed limits as well (\f(CW_BSD_LINE_MAX\fP or
+\f(CW_POSIX2_LINE_MAX\fP). Modern FreeBSD, NetBSD, all GNU
+implementations and the Heirloom cut is able to handle
+arbitrary numbers of fields and line lengths \(en the memory
+is allocated dynamically. OpenBSD cut is a hybrid: It has a fixed
+maximum number of fields, but allows arbitrary line lengths.
+The limited number of fields does, however, not appear to be
+any practical problem, because \f(CW_POSIX2_LINE_MAX\fP is
+guaranteed to be at least 2048 and is thus probably large enough.
+
+.SH
+Descriptions
+.LP
+Interesting, as well, is a comparison of the short descriptions
+of cut, as can be found in the headlines of the man
+pages or at the beginning of the source code files.
+The following list is roughly sorted by time and grouped by
+decent:
+.TS
+center;
+l l.
+CB UNIX	cut out selected fields of each line of a file
+System III	cut out selected fields of each line of a file
+System III \(dg	cut and paste columns of a table (projection of a relation)
+System V	cut out selected fields of each line of a file
+HP-UX	cut out (extract) selected fields of each line of a file
+.sp .3
+4.3BSD-UWisc \(dg	cut and paste columns of a table (projection of a relation)
+4.3BSD-Reno	select portions of each line of a file
+NetBSD	select portions of each line of a file
+OpenBSD 4.6	select portions of each line of a file
+FreeBSD 1.0	select portions of each line of a file
+FreeBSD 10.0	cut out selected portions of each line of a file
+SunOS 4.1.3	remove selected fields from each line of a file
+SunOS 5.5.1	cut out selected fields of each line of a file
+.sp .3
+Heirloom Tools	cut out selected fields of each line of a file
+Heirloom Tools \(dg	cut out fields of lines of files
+.sp .3
+GNU coreutils	remove sections from each line of files
+.sp .3
+Minix	select out columns of a file
+.sp .3
+Version 8 Unix	rearrange columns of data
+``Unix Reader''	rearrange columns of text
+.sp .3
+POSIX	cut out selected fields of each line of a file
+.TE
+.LP
+(The descriptions that are marked with `\(dg' were taken from
+source code files. The POSIX entry contains the description
+used in the standard. The ``Unix Reader'' is a retrospective
+document by Doug McIlroy, which lists the availability of
+tools in the Research Unix versions
+.[[ http://doc.cat-v.org/unix/unix-reader/contents.pdf .
+Its description should actually match the one in Version 8
+Unix. The change could be a transfer mistake or a correction.
+All other descriptions originate from the various man pages.)
+.PP
+Over time, the POSIX description was often adopted or it
+served as inspiration. One such example is FreeBSD
+.[[ https://svnweb.freebsd.org/base?view=revision&revision=167101 .
+.PP
+It is noteworthy that the GNU coreutils in all versions
+describe the performed action as a removal of parts of the
+input, although the user clearly selects the parts that are
+output. Probably the words ``cut out'' are too misleading.
+HP-UX concretized them.
+.PP
+There are also different terms used for the thing being
+selected. Some talk about fields (POSIX), some talk
+about portions (BSD) and some call it columns (Research
+Unix).
+.PP
+The seemingly least adequate description, the one of Version
+8 Unix (``rearrange columns of data'') is explainable in so
+far that the man page covers both, cut and paste, and in
+their combination, columns can be rearranged. The use of
+``data'' instead of ``text'' might be a lapse, which McIlroy
+corrected in his Unix Reader ... but, on the other hand, on
+Unix, the two words are mostly synonymous, because all data
+is text.
+
+
+.SH
+Referenzen
+.LP
+.nf
+._r
+