docs/cut

view cut.en.ms @ 31:106609b64dc4

minor corrections and improvements in the text
author markus schnalke <meillo@marmaro.de>
date Tue, 15 Sep 2015 17:20:20 +0200
parents 6977e2ee5dc5
children 5f78bcd34eeb
line source
1 .so macros
2 .lc_ctype en_US.utf8
3 .pl -4v
5 .TL
6 Cut out selected fields of each line of a file
7 .AU
8 markus schnalke <meillo@marmaro.de>
9 ..
10 .FS
11 2015-05.
12 This text is in the public domain (CC0).
13 It is available online:
14 .I http://marmaro.de/docs/
15 .FE
17 .LP
18 Cut is a classic program in the Unix toolchest.
19 It is present in most tutorials on shell programming, because it
20 is such a nice and useful tool with good explanatory value.
21 This text shall take a look underneath its surface.
22 .SH
23 Usage
24 .LP
25 Initially, cut had two operation modes, which were later amended
26 by a third: The cut program may cut specified characters or bytes
27 out of the input lines or it may cut out specified fields, which
28 are defined by a delimiting character.
29 .PP
30 The character mode is well suited to slice fixed-width input
31 formats into parts. One might, for instance, extract the access
32 rights from the output of \f(CWls -l\fP, as shown here with the
33 rights of a file's owner:
34 .CS
35 $ ls -l foo
36 -rw-rw-r-- 1 meillo users 0 May 12 07:32 foo
37 .sp .3
38 $ ls -l foo | cut -c 2-4
39 rw-
40 .CE
41 .LP
42 Or the write permission for the owner, the group, and the
43 world:
44 .CS
45 $ ls -l foo | cut -c 3,6,9
46 ww-
47 .CE
48 .LP
49 Cut can also be used to shorten strings:
50 .CS
51 $ long=12345678901234567890
52 .sp .3
53 $ echo "$long" | cut -c -10
54 1234567890
55 .CE
56 .LP
57 This command outputs no more than the first 10 characters of
58 \f(CW$long\fP. (Alternatively, on could use \f(CWprintf
59 "%.10s\\n" "$long"\fP for this task.)
60 .PP
61 However, if it's not about displaying characters, but rather about
62 storing them, then \f(CW-c\fP is only partly suited. In former times,
63 when US-ASCII was the omnipresent character encoding, each
64 character was stored as exactly one byte. Therefore, \f(CWcut
65 -c\fP selected both output characters and bytes equally. With
66 the uprise of multi-byte encodings (like UTF-8), this assumption
67 became obsolete. Consequently, a byte mode (option \f(CW-b\fP)
68 was added to cut, with POSIX.2-1992. To select up to 500 bytes
69 from the beginning of each line (and ignore the rest), one can use:
70 .CS
71 $ cut -b -500
72 .CE
73 .LP
74 The remainder can be caught with \f(CWcut -b 501-\fP. This
75 use of cut is important for POSIX, because it provides a
76 transformation of text files with arbitrary line lenghts to text
77 files with limited line length
78 .[[ http://pubs.opengroup.org/onlinepubs/9699919799/utilities/cut.html#tag_20_28_17 .
79 .PP
80 The introduction of the new byte mode essentially held the same
81 functionality as the old character mode. The character mode,
82 however, required a new, different implementation. In consequence,
83 the problem was not the support of the byte mode, but rather the
84 correct support of the new character mode.
85 .PP
86 Besides the character and byte modes, cut also offers a field
87 mode, which is activated by \f(CW-f\fP. It selects fields from
88 the input. The field-delimiter character for the input as well
89 as for the output (by default the tab) may be changed using
90 \f(CW-d\fP.
91 .PP
92 The typical example for the use of cut's field mode is the
93 selection of information from the password file. Here, for
94 instance, the usernames and their uids:
95 .CS
96 $ cut -d: -f1,3 /etc/passwd
97 root:0
98 bin:1
99 daemon:2
100 mail:8
101 ...
102 .CE
103 .LP
104 (The values to the command line switches may be appended directly
105 to them or separated by whitespace.)
106 .PP
107 The field mode is suited for simple tabulary data, like the
108 password file. Beyond that, it soon reaches its limits. The typical
109 case of whitespace-separated fields, in particular, is covered
110 poorly by it. Cut's delimiter is exactly one character,
111 therefore one can not split at both space and tab characters.
112 Furthermore, multiple adjacent delimiter characters lead to
113 empty fields. This is not the expected behavior for
114 the processing of whitespace-separated fields. Some
115 implementations, e.g. the one of FreeBSD, have extensions that
116 handle this case in the expected way. On other systems or
117 to stay portable, awk comes to rescue.
118 .PP
119 Awk provides another functionality that cut lacks: Changing the order
120 of the fields in the output. For cut, the order of the field
121 selection specification is irrelevant; it doesn't even matter if
122 fields occur multiple times. Thus, the invocation
123 \f(CWcut -c 5-8,1,4-6\fP outputs the characters number
124 1, 4, 5, 6, 7, and 8 in exactly this order. The
125 selection specification resembles mathematical set theory: Each
126 specified field is part of the solution set. The fields in the
127 solution set are always in the same order as in the input. To
128 speak with the words of the man page in Version 8 Unix:
129 ``In data base parlance, it projects a relation.''
130 .[[ http://man.cat-v.org/unix_8th/1/cut
131 This means that cut applies the \fIprojection\fP database operation
132 to the text input. Wikipedia explains it in the following way:
133 ``In practical terms, it can be roughly thought of as picking a
134 sub-set of all available columns.''
135 .[[ https://en.wikipedia.org/wiki/Projection_(relational_algebra)
137 .SH
138 Historical Background
139 .LP
140 Cut came to public life in 1982 with the release of UNIX System
141 III. Browsing through the sources of System III, one finds cut.c
142 with the timestamp 1980-04-11
143 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=SysIII/usr/src/cmd .
144 This is the oldest implementation of the program I was able to
145 discover. However, the SCCS-ID in the source code contains the
146 version number 1.5. According to Doug McIlroy
147 .[[ http://minnie.tuhs.org/pipermail/tuhs/2015-May/004083.html ,
148 the earlier history likely lies in PWB/UNIX, which was the
149 basis for System III. In the available sources of PWB 1.0 (1977)
150 .[[ http://minnie.tuhs.org/Archive/PDP-11/Distributions/usdl/ ,
151 no cut is present. Of PWB 2.0, no sources or useful documentation
152 seem to be available. PWB 3.0 was later renamed to System III
153 for marketing purposes only; it is otherwise identical to it. A
154 branch of PWB was CB UNIX, which was only used in the Bell Labs
155 internally. The manual of CB UNIX Edition 2.1 of November 1979
156 contains the earliest mention of cut that my research brought
157 to light, in the form of a man page
158 .[[ ftp://sunsite.icm.edu.pl/pub/unix/UnixArchive/PDP-11/Distributions/other/CB_Unix/cbunix_man1_02.pdf .
159 .PP
160 A look at BSD: There, my earliest discovery is a cut.c with
161 the file modification date of 1986-11-07
162 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-UWisc/src/usr.bin/cut
163 as part of the special version 4.3BSD-UWisc
164 .[[ http://gunkies.org/wiki/4.3_BSD_NFS_Wisconsin_Unix ,
165 which was released in January 1987.
166 This implementation is mostly identical to the one in System
167 III. The better known 4.3BSD-Tahoe (1988) does not contain cut.
168 The subsequent 4.3BSD-Reno (1990) does include cut. It is a freshly
169 written one by Adam S. Moskowitz and Marciano Pitargue, which was
170 included in BSD in 1989
171 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut .
172 Its man page
173 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut/cut.1
174 already mentions the expected compliance to POSIX.2.
175 One should note that POSIX.2 was first published in
176 September 1992, about two years after the man page and the
177 program were written. Hence, the program must have been
178 implemented based on a draft version of the standard. A look into
179 the code confirms the assumption. The function to parse the field
180 selection includes the following comment:
181 .QP
182 This parser is less restrictive than the Draft 9 POSIX spec.
183 POSIX doesn't allow lists that aren't in increasing order or
184 overlapping lists.
185 .LP
186 Draft 11.2 of POSIX (1991-09) requires this flexibility already:
187 .QP
188 The elements in list can be repeated, can overlap, and can
189 be specified in any order.
190 .LP
191 The same draft additionally includes all three operation modes,
192 whereas this early BSD cut only implemented the original two.
193 Draft 9 might not have included the byte mode. Without access to
194 Draft 9 or 10, it wasn't possible to verify this guess.
195 .PP
196 The version numbers and change dates of the older BSD
197 implementations are manifested in the SCCS-IDs, which the
198 version control system of that time inserted. For instance
199 in 4.3BSD-Reno: ``5.3 (Berkeley) 6/24/90''.
200 .PP
201 The cut implementation of the GNU coreutils contains the
202 following copyright notice:
203 .CS
204 Copyright (C) 1997-2015 Free Software Foundation, Inc.
205 Copyright (C) 1984 David M. Ihnat
206 .CE
207 .LP
208 This code does have old origins. Further comments show that
209 the source code was reworked by David MacKenzie first and later
210 by Jim Meyering, who put it into the version control system in
211 1992. It is unclear why the years until 1997, at least from
212 1992 onwards, don't show up in the copyright notice.
213 .PP
214 Despite all those year numbers from the 80s, cut is a rather
215 young tool, at least in relation to the early Unix. Despite
216 being a decade older than Linux (the kernel), Unix was present
217 for over ten years already by the time cut appeared for the first
218 time. Most notably, cut wasn't part of Version 7 Unix, which
219 became the basis for all modern Unix systems. The more complex
220 tools sed and awk were part of it already. Hence, the
221 question comes to mind why cut was written at all, as two
222 programs already existed that were able to cover its use
223 cases. One reason for cut surely was its compactness and the
224 resulting speed, in comparison to the then-bulky awk. This lean
225 shape goes well with the Unix philosopy: Do one job and do it
226 well! Cut was sufficiently convincing. It found its way to
227 other Unix variants, it became standardized, and today it is
228 present everywhere.
229 .PP
230 The original variant (without \f(CW-b\fP) was described already
231 in 1985, by the System V Interface Definition, an important
232 formal description of UNIX System V. In the following years, it
233 appeared in all relevant standards. POSIX.2 specified cut for
234 the first time in its modern form (with \f(CW-b\fP) in 1992.
236 .SH
237 Multi-byte support
238 .LP
239 The byte mode and thus the multi-byte support of the POSIX
240 character mode have been standardized since 1992. But are
241 they present in the available implementations? Which versions
242 implement POSIX correctly?
243 .PP
244 The situation is divided into three parts: There are historic
245 implementations, which have only \f(CW-c\fP and \f(CW-f\fP.
246 Then there are implementations that have \f(CW-b\fP, but
247 treat it as an alias for \f(CW-c\fP only. These
248 implementations work correctly for single-byte encodings
249 (e.g. US-ASCII, Latin1) but for multi-byte encodings (e.g.
250 UTF-8) their \f(CW-c\fP behaves like \f(CW-b\fP (and
251 \f(CW-n\fP is ignored). Finally, there are implementations
252 that implement \f(CW-c\fP and \f(CW-b\fP in a POSIX-compliant
253 way.
254 .PP
255 Historic two-mode implementations are the ones of
256 System III, System V, and the BSD ones until the mid-90s.
257 .PP
258 Pseudo multi-byte implementations are provided by GNU,
259 modern NetBSD, and modern OpenBSD. The level of POSIX compliance
260 that is presented there is often higher than the level of
261 compliance that is actually provided. Sometimes it takes a
262 close look to discover that \f(CW-c\fP and \f(CW-n\fP don't
263 behave as expected. Some of the implementations take the
264 easy way by simply being ignorant to any multi-byte
265 encodings, at least they declare that clearly:
266 .QP
267 Since we don't support multi-byte characters, the \f(CW-c\fP
268 and \f(CW-b\fP options are equivalent, and the \f(CW-n\fP
269 option is meaningless.
270 .[[ http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/cut/cut.c?rev=1.18&content-type=text/x-cvsweb-markup
271 .LP
272 Standard-adhering implementations, i.e. ones that treat
273 multi-byte characters correctly, are those of the modern
274 FreeBSD and the Heirloom toolchest. Tim Robbins
275 reimplemented the character mode of FreeBSD cut,
276 conforming to POSIX, in the summer of 2004
277 .[[ https://svnweb.freebsd.org/base?view=revision&revision=131194 .
278 The question why the other BSD systems have not
279 integrated this change is an open one. Maybe the answer an be
280 found in the above quoted statement.
281 .PP
282 How do users find out if the cut on their own system handles
283 multi-byte characters correctly? First, one needs to check if
284 the system itself uses multi-byte characters, because otherwise
285 characters and bytes are equivalent and the question
286 is irrelevant. One can check this by looking at the locale
287 settings, but it is easier to print a typical multi-byte
288 character, for instance an Umlaut or the Euro currency
289 symbol, and check if one or more bytes are generated as
290 output:
291 .CS
292 $ echo ä | od -c
293 0000000 303 244 \\n
294 0000003
295 .CE
296 .LP
297 In this case it resulted in two bytes: octal 303 and 244. (The
298 newline character is added by echo.)
299 .PP
300 The program iconv converts text to specific encodings. This
301 is the output for Latin1 and UTF-8, for comparison:
302 .CS
303 $ echo ä | iconv -t latin1 | od -c
304 0000000 344 \\n
305 0000002
306 .sp .3
307 $ echo ä | iconv -t utf8 | od -c
308 0000000 303 244 \\n
309 0000003
310 .CE
311 .LP
312 The output (without the iconv conversion) on many European
313 systems equals one of these two.
314 .PP
315 Now for the test of the cut implementation. On a UTF-8 system, a
316 POSIX-compliant implementation behaves as such:
317 .CS
318 $ echo ä | cut -c 1 | od -c
319 0000000 303 244 \\n
320 0000003
321 .sp .3
322 $ echo ä | cut -b 1 | od -c
323 0000000 303 \\n
324 0000002
325 .sp .3
326 $ echo ä | cut -b 1 -n | od -c
327 0000000 \\n
328 0000001
329 .CE
330 .LP
331 A pseudo-POSIX implementation, in contrast, behaves like the
332 middle one for all three invocations: Only the first byte is
333 printed as output.
335 .SH
336 Implementations
337 .LP
338 Let's take a look at the sources of a selection of
339 implementations.
340 .PP
341 A comparison of the amount of source code is good to get a first
342 impression. Typically, it grows through time. This can generally
343 be seen here, but not in all cases. A POSIX-compliant
344 implementation of the character mode requires more code, thus
345 these implementations tend to be the larger ones.
346 .TS
347 center;
348 r r r l l l.
349 SLOC Lines Bytes Belongs to File time Category
350 _
351 116 123 2966 System III 1980-04-11 historic
352 118 125 3038 4.3BSD-UWisc 1986-11-07 historic
353 200 256 5715 4.3BSD-Reno 1990-06-25 historic
354 200 270 6545 NetBSD 1993-03-21 historic
355 218 290 6892 OpenBSD 2008-06-27 pseudo-POSIX
356 224 296 6920 FreeBSD 1994-05-27 historic
357 232 306 7500 NetBSD 2014-02-03 pseudo-POSIX
358 340 405 7423 Heirloom 2012-05-20 POSIX
359 382 586 14175 GNU coreutils 1992-11-08 pseudo-POSIX
360 391 479 10961 FreeBSD 2012-11-24 POSIX
361 588 830 23167 GNU coreutils 2015-05-01 pseudo-POSIX
362 .TE
363 .LP
364 There are four rough groups: (1) The two original
365 implementations, which are mostly identical, with about 100
366 SLOC. (2) The five BSD versions, with about 200 SLOC. (3) The
367 two POSIX-compliant versions and the old GNU one, with a SLOC
368 count in the 300s. And finally, (4) the modern GNU cut with
369 almost 600 SLOC.
370 .PP
371 The variation between the number of logical code
372 lines (SLOC, measured with SLOCcount) and the number of
373 newlines in the file (\f(CWwc -l\fP) spans between factor
374 1.06 for the oldest versions and factor 1.5 for GNU. The
375 largest influence on it are empty lines, pure comment lines,
376 and the size of the license block at the beginning of the file.
377 .PP
378 Regarding the variation between logical code lines and the
379 file size (\f(CWwc -c\fP), the implementations span between
380 25 and 30 bytes per statement. With only 21 bytes per
381 statement, the Heirloom implementation marks the lower end;
382 the GNU implementation sets the upper limit at nearly 40 bytes. In
383 the case of GNU, the reason is mainly their coding style, with
384 special indentation rules and long identifiers. Whether one finds
385 the Heirloom implementation
386 .[[ http://heirloom.cvs.sourceforge.net/viewvc/heirloom/heirloom/cut/cut.c?revision=1.6&view=markup
387 highly cryptic or exceptionally elegant shall be left
388 to the judgement of the reader. Especially the
389 comparison to the GNU implementation
390 .[[ http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/cut.c;hb=e981643
391 is impressive.
392 .PP
393 The internal structure of the source code (in all cases it is
394 written in C) is mainly similar. Besides the mandatory main
395 function, which does the command line argument processing,
396 there usually is a function to convert the field
397 selection specification to an internal data structure.
398 Furthermore, almost all implementations have separate
399 functions for each of their operation modes. The POSIX-compliant
400 versions treat the \f(CW-b -n\fP combination as a separate
401 mode and thus implement it in a separate function. Only the early
402 System III implementation (and its 4.3BSD-UWisc variant) do
403 everything, apart from error handling, in the main function.
404 .PP
405 Implementations of cut typically have two limiting aspects:
406 One being the maximum number of fields that can be handled,
407 the other being the maximum line length. On System III, both
408 numbers are limited to 512. 4.3BSD-Reno and the BSDs of the
409 90s have fixed limits as well (\f(CW_BSD_LINE_MAX\fP or
410 \f(CW_POSIX2_LINE_MAX\fP). Modern FreeBSD, modern NetBSD, all GNU
411 implementations, and the Heirloom cut are able to handle
412 arbitrary numbers of fields and line lengths \(en the memory
413 is allocated dynamically. OpenBSD cut is a hybrid: It has a fixed
414 maximum number of fields, but allows arbitrary line lengths.
415 The limited number of fields does not, however, appear to be
416 any practical problem, because \f(CW_POSIX2_LINE_MAX\fP is
417 guaranteed to be at least 2048 and is thus probably large enough.
419 .SH
420 Descriptions
421 .LP
422 Interesting, as well, is a comparison of the short descriptions
423 of cut, as can be found in the headlines of the man
424 pages or at the beginning of the source code files.
425 The following list is roughly grouped by origin:
426 .TS
427 center;
428 l l.
429 CB UNIX cut out selected fields of each line of a file
430 System III cut out selected fields of each line of a file
431 System III \(dg cut and paste columns of a table (projection of a relation)
432 System V cut out selected fields of each line of a file
433 HP-UX cut out (extract) selected fields of each line of a file
434 .sp .3
435 4.3BSD-UWisc \(dg cut and paste columns of a table (projection of a relation)
436 4.3BSD-Reno select portions of each line of a file
437 NetBSD select portions of each line of a file
438 OpenBSD 4.6 select portions of each line of a file
439 FreeBSD 1.0 select portions of each line of a file
440 FreeBSD 10.0 cut out selected portions of each line of a file
441 SunOS 4.1.3 remove selected fields from each line of a file
442 SunOS 5.5.1 cut out selected fields of each line of a file
443 .sp .3
444 Heirloom Tools cut out selected fields of each line of a file
445 Heirloom Tools \(dg cut out fields of lines of files
446 .sp .3
447 GNU coreutils remove sections from each line of files
448 .sp .3
449 Minix select out columns of a file
450 .sp .3
451 Version 8 Unix rearrange columns of data
452 ``Unix Reader'' rearrange columns of text
453 .sp .3
454 POSIX cut out selected fields of each line of a file
455 .TE
456 .LP
457 (The descriptions that are marked with `\(dg' were taken from
458 source code files. The POSIX entry contains the description
459 used in the standard. The ``Unix Reader'' is a retrospective
460 document by Doug McIlroy, which lists the availability of
461 tools in the Research Unix versions
462 .[[ http://doc.cat-v.org/unix/unix-reader/contents.pdf .
463 Its description should actually match the one in Version 8
464 Unix. The change could be a transfer mistake or a correction.
465 All other descriptions originate from the various man pages.)
466 .PP
467 Over time, the POSIX description was often adopted or it
468 served as inspiration. One such example is FreeBSD
469 .[[ https://svnweb.freebsd.org/base?view=revision&revision=167101 .
470 .PP
471 It is noteworthy that the GNU coreutils in all versions
472 describe the performed action as a removal of parts of the
473 input, although the user clearly selects the parts that then
474 consistute the output. Probably the words ``cut out'' are too
475 misleading. HP-UX tried to be more clear.
476 .PP
477 Different terms are also used for the part being
478 selected. Some talk about fields (POSIX), some talk
479 about portions (BSD) and some call it columns (Research
480 Unix).
481 .PP
482 The seemingly least adequate description, the one of Version
483 8 Unix (``rearrange columns of data'') is explainable in so
484 far that the man page covers both cut and paste, and in
485 their combination, columns can be rearranged. The use of
486 ``data'' instead of ``text'' might be a lapse, which McIlroy
487 corrected in his Unix Reader ... but on the other hand, on
488 Unix, the two words are mostly synonymous, because all data
489 is text.
492 .SH
493 References
494 .LP
495 .nf
496 ._r