docs/cut

view cut.en.ms @ 38:ec76f8926598

clarify a statement Thanks to Francesc for the suggestion.
author markus schnalke <meillo@marmaro.de>
date Tue, 06 Oct 2015 10:43:26 +0200
parents c338b706447b
children e294684cf338
line source
1 .so macros
2 .lc_ctype en_US.utf8
3 .pl -3v
5 .TL
6 Cut out selected fields of each line of a file
7 .AU
8 markus schnalke <meillo@marmaro.de>
9 ..
10 .FS
11 2015-05.
12 This text is part of the public domain (CC0).
13 It is available online:
14 .I http://marmaro.de/docs/
15 .FE
17 .LP
18 Cut is a classic program in the Unix toolchest.
19 It is present in most tutorials on shell programming, because it
20 is such a nice and useful tool with good explanatory value.
21 This text shall take a look underneath its surface.
22 .SH
23 Usage
24 .LP
25 Initially, cut had two operation modes, which were later amended
26 by a third: The cut program may cut specified characters or bytes
27 out of the input lines or it may cut out specified fields, which
28 are defined by a delimiting character.
29 .PP
30 The character mode is well suited to slice fixed-width input
31 formats into parts. One might, for instance, extract the access
32 rights from the output of \f(CWls -l\fP, as shown here with the
33 rights of a file's owner:
34 .CS
35 $ ls -l foo
36 -rw-rw-r-- 1 meillo users 0 May 12 07:32 foo
37 .sp .3
38 $ ls -l foo | cut -c 2-4
39 rw-
40 .CE
41 .LP
42 Or the write permission for the owner, the group, and the
43 world:
44 .CS
45 $ ls -l foo | cut -c 3,6,9
46 ww-
47 .CE
48 .LP
49 Cut can also be used to shorten strings:
50 .CS
51 $ long=12345678901234567890
52 .sp .3
53 $ echo "$long" | cut -c -10
54 1234567890
55 .CE
56 .LP
57 This command outputs no more than the first 10 characters of
58 \f(CW$long\fP. (Alternatively, on could use \f(CWprintf
59 "%.10s\\n" "$long"\fP for this task.)
60 .PP
61 However, if it's not about displaying characters, but rather about
62 storing them, then \f(CW-c\fP is only partly suited. In former times,
63 when US-ASCII was the omnipresent character encoding, each
64 character was stored as exactly one byte. Therefore, \f(CWcut
65 -c\fP selected both output characters and bytes equally. With
66 the uprise of multi-byte encodings (like UTF-8), this assumption
67 became obsolete. Consequently, a byte mode (option \f(CW-b\fP)
68 was added to cut, with POSIX.2-1992. To select up to 500 bytes
69 from the beginning of each line (and ignore the rest), one can use:
70 .CS
71 $ cut -b -500
72 .CE
73 .LP
74 The remainder can be caught with \f(CWcut -b 501-\fP. This
75 use of cut is important for POSIX, because it provides a
76 transformation of text files with arbitrary line lengths to text
77 files with limited line length
78 .[[ http://pubs.opengroup.org/onlinepubs/9699919799/utilities/cut.html#tag_20_28_17 .
79 .PP
80 The introduction of the new byte mode essentially held the same
81 functionality as the old character mode. The character mode,
82 however, required a new, different implementation. In consequence,
83 the problem was not the support of the byte mode, but rather the
84 correct support of the new character mode.
85 .PP
86 Besides the character and byte modes, cut also offers a field
87 mode, which is activated by \f(CW-f\fP. It selects fields from
88 the input. The field-delimiter character for the input as well
89 as for the output (by default the tab) may be changed using
90 \f(CW-d\fP.
91 .PP
92 The typical example for the use of cut's field mode is the
93 selection of information from the password file. Here, for
94 instance, the usernames and their uids:
95 .CS
96 $ cut -d: -f1,3 /etc/passwd
97 root:0
98 bin:1
99 daemon:2
100 mail:8
101 ...
102 .CE
103 .LP
104 (The values to the command line switches may be appended directly
105 to them or separated by white\%space.)
106 .PP
107 The field mode is suited for simple tabular data, like the
108 password file. Beyond that, it soon reaches its limits. The typical
109 case of whitespace-separated fields, in particular, is covered
110 poorly by it. Cut's delimiter is exactly one character,
111 therefore one can not split at both space and tab characters.
112 Furthermore, multiple adjacent delimiter characters lead to
113 empty fields. This is not the expected behavior for
114 the processing of whitespace-separated fields. Some
115 implementations, e.g. the one of FreeBSD, have extensions that
116 handle this case in the expected way. On other systems or
117 to stay portable, awk comes to rescue.
118 .PP
119 Awk provides another functionality that cut lacks: Changing the order
120 of the fields in the output. For cut, the order of the field
121 selection specification is irrelevant; it doesn't even matter if
122 fields occur multiple times. Thus, the invocation
123 \f(CWcut -c 5-8,1,4-6\fP outputs the characters number
124 1, 4, 5, 6, 7, and 8 in ascending order. The
125 selection specification resembles mathematical set theory: Each
126 specified field is part of the solution set. The fields in the
127 solution set are always in the same order as in the input. To
128 speak with the words of the man page in Version 8 Unix:
129 ``In data base parlance, it projects a relation.''
130 .[[ http://man.cat-v.org/unix_8th/1/cut
131 This means that cut applies the \fIprojection\fP database operation
132 to the text input. Wikipedia explains it in the following way:
133 ``In practical terms, it can be roughly thought of as picking a
134 sub-set of all available columns.''
135 .[[ https://en.wikipedia.org/wiki/Projection_(relational_algebra)
137 .SH
138 Historical Background
139 .LP
140 Cut came to public life in 1982 with the release of UNIX System
141 III. Browsing through the sources of System III, one finds cut.c
142 with the timestamp 1980-04-11
143 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=SysIII/usr/src/cmd .
144 This is the oldest implementation of the program I was able to
145 discover. However, the SCCS-ID in the source code contains the
146 version number 1.5. According to Doug McIlroy
147 .[[ http://minnie.tuhs.org/pipermail/tuhs/2015-May/004083.html ,
148 the earlier history likely lies in PWB/UNIX, which was the
149 basis for System III. In the available sources of PWB 1.0 (1977)
150 .[[ http://minnie.tuhs.org/Archive/PDP-11/Distributions/usdl/ ,
151 no cut is present. Of PWB 2.0, no sources or useful documentation
152 seem to be available. PWB 3.0 was later renamed to System III
153 for marketing purposes only; it is otherwise identical to it. A
154 branch of PWB was CB UNIX, which was only used in the Bell Labs
155 internally. The manual of CB UNIX Edition 2.1 of November 1979
156 contains the earliest mention of cut that my research brought
157 to light, in the form of a man page
158 .[[ ftp://sunsite.icm.edu.pl/pub/unix/UnixArchive/PDP-11/Distributions/other/CB_Unix/cbunix_man1_02.pdf .
159 .PP
160 A look at BSD: There, my earliest discovery is a cut.c with
161 the file modification date of 1986-11-07
162 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-UWisc/src/usr.bin/cut
163 as part of the special version 4.3BSD-UWisc
164 .[[ http://gunkies.org/wiki/4.3_BSD_NFS_Wisconsin_Unix ,
165 which was released in January 1987.
166 This implementation is mostly identical to the one in System
167 III. The better known 4.3BSD-Tahoe (1988) does not contain cut.
168 The subsequent 4.3BSD-Reno (1990) does include cut. It is a freshly
169 written one by Adam S. Moskowitz and Marciano Pitargue, which was
170 included in BSD in 1989
171 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut .
172 Its man page
173 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut/cut.1
174 already mentions the expected compliance to POSIX.2.
175 One should note that POSIX.2 was first published in
176 September 1992, about two years after the man page and the
177 program were written. Hence, the program must have been
178 implemented based on a draft version of the standard. A look into
179 the code confirms the assumption. The function to parse the field
180 selection includes the following comment:
181 .QP
182 This parser is less restrictive than the Draft 9 POSIX spec.
183 POSIX doesn't allow lists that aren't in increasing order or
184 overlapping lists.
185 .LP
186 Draft 11.2 of POSIX (1991-09) requires this flexibility already:
187 .QP
188 The elements in list can be repeated, can overlap, and can
189 be specified in any order.
190 .LP
191 The same draft additionally includes all three operation modes,
192 whereas this early BSD cut only implemented the original two.
193 Draft 9 might not have included the byte mode. Without access to
194 Draft 9 or 10, it wasn't possible to verify this guess.
195 .PP
196 The version numbers and change dates of the older BSD
197 implementations are manifested in the SCCS-IDs, which the
198 version control system of that time inserted. For instance
199 in 4.3BSD-Reno: ``5.3 (Berkeley) 6/24/90''.
200 .PP
201 The cut implementation of the GNU coreutils contains the
202 following copyright notice:
203 .CS
204 Copyright (C) 1997-2015 Free Software Foundation, Inc.
205 Copyright (C) 1984 David M. Ihnat
206 .CE
207 .LP
208 This code does have old origins. Further comments show that
209 the source code was reworked by David MacKenzie first and later
210 by Jim Meyering, who put it into the version control system in
211 1992. It is unclear why the years until 1997, at least from
212 1992 onwards, don't show up in the copyright notice.
213 .PP
214 Despite all those year numbers from the 80s, cut is a rather
215 young tool, at least in relation to the early Unix. Despite
216 being a decade older than Linux (the kernel), Unix was present
217 for over ten years already by the time cut appeared for the first
218 time. Most notably, cut wasn't part of Version 7 Unix, which
219 became the basis for all modern Unix systems. The more complex
220 tools sed and awk were part of it already. Hence, the
221 question comes to mind why cut was written at all, as two
222 programs already existed that were able to cover its use
223 cases. One reason for cut surely was its compactness and the
224 resulting speed, in comparison to the then-bulky awk. This lean
225 shape goes well with the Unix philosophy: Do one job and do it
226 well! Cut was sufficiently convincing. It found its way to
227 other Unix variants, it became standardized, and today it is
228 present everywhere.
229 .PP
230 The original variant (without \f(CW-b\fP) was described already
231 in 1985, by the System V Interface Definition, an important
232 formal description of UNIX System V. In the following years, it
233 appeared in all relevant standards. POSIX.2 specified cut for
234 the first time in its modern form (with \f(CW-b\fP) in 1992.
236 .pl -1v
237 .SH
238 Multi-byte support
239 .LP
240 The byte mode and thus the multi-byte support of the POSIX
241 character mode have been standardized since 1992. But are
242 they present in the available implementations? Which versions
243 implement POSIX correctly?
244 .PP
245 The situation is divided into three parts: There are historic
246 implementations, which have only \f(CW-c\fP and \f(CW-f\fP.
247 Then there are implementations that have \f(CW-b\fP, but
248 treat it as an alias for \f(CW-c\fP only. These
249 implementations work correctly for single-byte encodings
250 (e.g. US-ASCII, Latin1) but for multi-byte en\%codings (e.g.
251 UTF-8) their \f(CW-c\fP behaves like \f(CW-b\fP (and
252 \f(CW-n\fP is ignored). Finally, there are implementations
253 that implement \f(CW-c\fP and \f(CW-b\fP in a POSIX-compliant
254 way.
255 .PP
256 Historic two-mode implementations are the ones of
257 System III, System V, and the BSD ones until the mid-90s.
258 .PP
259 Pseudo multi-byte implementations are provided by GNU,
260 modern NetBSD, and modern OpenBSD. The level of POSIX compliance
261 that is presented there is often higher than the level of
262 compliance that is actually provided. Sometimes it takes a
263 close look to discover that \f(CW-c\fP and \f(CW-n\fP don't
264 behave as expected. Some of the implementations take the
265 easy way by simply being ignorant to any multi-byte
266 encodings, at least they declare that clearly:
267 .QP
268 Since we don't support multi-byte characters, the \f(CW-c\fP
269 and \f(CW-b\fP options are equivalent, and the \f(CW-n\fP
270 option is meaningless.
271 .[[ http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/cut/cut.c?rev=1.18&content-type=text/x-cvsweb-markup
272 .LP
273 Standard-adhering implementations, i.e. ones that treat
274 multi-byte characters correctly, are those of the modern
275 FreeBSD and the Heirloom toolchest. Tim Robbins
276 reimplemented the character mode of FreeBSD cut,
277 conforming to POSIX, in the summer of 2004
278 .[[ https://svnweb.freebsd.org/base?view=revision&revision=131194 .
279 The question why the other BSD systems have not
280 integrated this change is an open one. Maybe the answer is
281 a general ignorance of internationalization.
282 .PP
283 How do users find out if the cut on their own system handles
284 multi-byte characters correctly? First, one needs to check if
285 the system itself uses multi-byte characters, because otherwise
286 characters and bytes are equivalent and the question
287 is irrelevant. One can check this by looking at the locale
288 settings, but it is easier to print a typical multi-byte
289 character, for instance an Umlaut or the Euro currency
290 symbol, and check if one or more bytes are generated as
291 output:
292 .CS
293 $ echo ä | od -c
294 0000000 303 244 \\n
295 0000003
296 .CE
297 .LP
298 In this case it resulted in two bytes: octal 303 and 244. (The
299 newline character is added by echo.)
300 .PP
301 The program iconv converts text to specific encodings. This
302 is the output for Latin1 and UTF-8, for comparison:
303 .CS
304 $ echo ä | iconv -t latin1 | od -c
305 0000000 344 \\n
306 0000002
307 .sp .3
308 $ echo ä | iconv -t utf8 | od -c
309 0000000 303 244 \\n
310 0000003
311 .CE
312 .LP
313 The output (without the iconv conversion) on many European
314 systems equals one of these two.
315 .PP
316 Now for the test of the cut implementation. On a UTF-8 system, a
317 POSIX-compliant implementation behaves as such:
318 .CS
319 $ echo ä | cut -c 1 | od -c
320 0000000 303 244 \\n
321 0000003
322 .sp .3
323 $ echo ä | cut -b 1 | od -c
324 0000000 303 \\n
325 0000002
326 .sp .3
327 $ echo ä | cut -b 1 -n | od -c
328 0000000 \\n
329 0000001
330 .CE
331 .LP
332 A pseudo-POSIX implementation, in contrast, behaves like the
333 middle one for all three invocations: Only the first byte is
334 printed as output.
336 .SH
337 Implementations
338 .LP
339 Let's take a look at the sources of a selection of
340 implementations.
341 .PP
342 A comparison of the amount of source code is good to get a first
343 impression. Typically, it grows through time. This can generally
344 be seen here, but not in all cases. A POSIX-compliant
345 implementation of the character mode requires more code, thus
346 these implementations tend to be the larger ones.
347 .TS
348 center;
349 r r r l l l.
350 SLOC Lines Bytes Belongs to File time Category
351 _
352 116 123 2966 System III 1980-04-11 historic
353 118 125 3038 4.3BSD-UWisc 1986-11-07 historic
354 200 256 5715 4.3BSD-Reno 1990-06-25 historic
355 200 270 6545 NetBSD 1993-03-21 historic
356 218 290 6892 OpenBSD 2008-06-27 pseudo-POSIX
357 224 296 6920 FreeBSD 1994-05-27 historic
358 232 306 7500 NetBSD 2014-02-03 pseudo-POSIX
359 340 405 7423 Heirloom 2012-05-20 POSIX
360 382 586 14175 GNU coreutils 1992-11-08 pseudo-POSIX
361 391 479 10961 FreeBSD 2012-11-24 POSIX
362 588 830 23167 GNU coreutils 2015-05-01 pseudo-POSIX
363 .TE
364 .LP
365 There are four rough groups: (1) The two original
366 implementations, which are mostly identical, with about 100
367 SLOC. (2) The five BSD versions, with about 200 SLOC. (3) The
368 two POSIX-compliant versions and the old GNU one, with a SLOC
369 count in the 300s. And finally, (4) the modern GNU cut with
370 almost 600 SLOC.
371 .PP
372 The variation between the number of logical code
373 lines (SLOC, measured with SLOCcount) and the number of
374 newlines in the file (\f(CWwc -l\fP) spans between factor
375 1.06 for the oldest versions and factor 1.5 for GNU. The
376 largest influence on it are empty lines, pure comment lines,
377 and the size of the license block at the beginning of the file.
378 .PP
379 Regarding the variation between logical code lines and the
380 file size (\f(CWwc -c\fP), the implementations span between
381 25 and 30 bytes per statement. With only 21 bytes per
382 statement, the Heirloom implementation marks the lower end;
383 the GNU implementation sets the upper limit at nearly 40 bytes. In
384 the case of GNU, the reason is mainly their coding style, with
385 special indentation rules and long identifiers. Whether one finds
386 the Heirloom implementation
387 .[[ http://heirloom.cvs.sourceforge.net/viewvc/heirloom/heirloom/cut/cut.c?revision=1.6&view=markup
388 highly cryptic or exceptionally elegant shall be left
389 to the judgement of the reader. Especially the
390 comparison to the GNU implementation
391 .[[ http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/cut.c;hb=e981643
392 is impressive.
393 .PP
394 The internal structure of the source code (in all cases it is
395 written in C) is mainly similar. Besides the mandatory main
396 function, which does the command line argument processing,
397 there usually is a function to convert the field
398 selection specification to an internal data structure.
399 Furthermore, almost all implementations have separate
400 functions for each of their operation modes. The POSIX-compliant
401 versions treat the \f(CW-b -n\fP combination as a separate
402 mode and thus implement it in a separate function. Only the early
403 System III implementation (and its 4.3BSD-UWisc variant) do
404 everything, apart from error handling, in the main function.
405 .PP
406 Implementations of cut typically have two limiting aspects:
407 One being the maximum number of fields that can be handled,
408 the other being the maximum line length. On System III, both
409 numbers are limited to 512. 4.3BSD-Reno and the BSDs of the
410 90s have fixed limits as well (\f(CW_BSD_LINE_MAX\fP or
411 \f(CW_POSIX2_LINE_MAX\fP). Modern FreeBSD, modern NetBSD, all GNU
412 implementations, and the Heirloom cut are able to handle
413 arbitrary numbers of fields and line lengths \(en the memory
414 is allocated dynamically. OpenBSD cut is a hybrid: It has a fixed
415 maximum number of fields, but allows arbitrary line lengths.
416 The limited number of fields does not, however, appear to be
417 any practical problem, because \f(CW_POSIX2_LINE_MAX\fP is
418 guaranteed to be at least 2048 and is thus probably large enough.
420 .SH
421 Descriptions
422 .LP
423 Interesting, as well, is a comparison of the short descriptions
424 of cut, as can be found in the headlines of the man
425 pages or at the beginning of the source code files.
426 The following list is roughly grouped by origin:
427 .TS
428 center;
429 l l.
430 CB UNIX cut out selected fields of each line of a file
431 System III cut out selected fields of each line of a file
432 System III \(dg cut and paste columns of a table (projection of a relation)
433 System V cut out selected fields of each line of a file
434 HP-UX cut out (extract) selected fields of each line of a file
435 .sp .3
436 4.3BSD-UWisc \(dg cut and paste columns of a table (projection of a relation)
437 4.3BSD-Reno select portions of each line of a file
438 NetBSD select portions of each line of a file
439 OpenBSD 4.6 select portions of each line of a file
440 FreeBSD 1.0 select portions of each line of a file
441 FreeBSD 10.0 cut out selected portions of each line of a file
442 SunOS 4.1.3 remove selected fields from each line of a file
443 SunOS 5.5.1 cut out selected fields of each line of a file
444 .sp .3
445 Heirloom Tools cut out selected fields of each line of a file
446 Heirloom Tools \(dg cut out fields of lines of files
447 .sp .3
448 GNU coreutils remove sections from each line of files
449 .sp .3
450 Minix select out columns of a file
451 .sp .3
452 Version 8 Unix rearrange columns of data
453 ``Unix Reader'' rearrange columns of text
454 .sp .3
455 POSIX cut out selected fields of each line of a file
456 .TE
457 .LP
458 (The descriptions that are marked with `\(dg' were taken from
459 source code files. The POSIX entry contains the description
460 used in the standard. The ``Unix Reader'' is a retrospective
461 document by Doug McIlroy, which lists the availability of
462 tools in the Research Unix versions
463 .[[ http://doc.cat-v.org/unix/unix-reader/contents.pdf .
464 Its description should actually match the one in Version 8
465 Unix. The change could be a transfer mistake or a correction.
466 All other descriptions originate from the various man pages.)
467 .PP
468 Over time, the POSIX description was often adopted or it
469 served as inspiration. One such example is FreeBSD
470 .[[ https://svnweb.freebsd.org/base?view=revision&revision=167101 .
471 .PP
472 It is noteworthy that the GNU coreutils in all versions
473 describe the performed action as a removal of parts of the
474 input, although the user clearly selects the parts that then
475 constitute the output. Probably the words ``cut out'' are too
476 misleading. HP-UX tried to be more clear.
477 .PP
478 Different terms are also used for the part being
479 selected. Some talk about fields (POSIX), some talk
480 about portions (BSD) and some call it columns (Research
481 Unix).
482 .PP
483 The seemingly least adequate description, the one of Version
484 8 Unix (``rearrange columns of data'') is explainable in so
485 far that the man page covers both cut and paste, and in
486 their combination, columns can be rearranged. The use of
487 ``data'' instead of ``text'' might be a lapse, which McIlroy
488 corrected in his Unix Reader ... but on the other hand, on
489 Unix, the two words are mostly synonymous, because all data
490 is text.
493 .SH
494 References
495 .LP
496 .nf
497 ._r