docs/cut

view cut.en.ms @ 27:5cefcfc72d42

Added first version of the translation to English
author markus schnalke <meillo@marmaro.de>
date Tue, 04 Aug 2015 21:04:10 +0200
parents
children 0d7329867dd1
line source
1 .so macros
2 .lc_ctype en_US.utf8
3 .pl -4v
5 .TL
6 Cut out selected fields of each line of a file
7 .AU
8 markus schnalke <meillo@marmaro.de>
9 ..
10 .FS
11 2015-05.
12 This text is in the public domain (CC0).
13 It is available online:
14 .I http://marmaro.de/docs/
15 .FE
17 .LP
18 Cut is a classic program in the Unix toolchest.
19 It is present in most tutorials on shell programming, because it
20 is such a nice and useful tool which good explanationary value.
21 This text shall take a look behind its surface.
22 .SH
23 Usage
24 .LP
25 Initially, cut had two operation modes, which were amended by a
26 third one, later. Cut may cut specified characters out of the
27 input lines or it may cut out specified fields, which are defined
28 by a delimiting character.
29 .PP
30 The character mode is well suited to slice fixed-width input
31 formats into parts. One might, for instance, extract the access
32 rights from the output of \f(CWls -l\fP, here the rights of the
33 file's owner:
34 .CS
35 $ ls -l foo
36 -rw-rw-r-- 1 meillo users 0 May 12 07:32 foo
37 .sp .3
38 $ ls -l foo | cut -c 2-4
39 rw-
40 .CE
41 .LP
42 Or the write permission for the owner, the group and the
43 world:
44 .CS
45 $ ls -l foo | cut -c 3,6,9
46 ww-
47 .CE
48 .LP
49 Cut can also be used to shorten strings:
50 .CS
51 $ long=12345678901234567890
52 .sp .3
53 $ echo "$long" | cut -c -10
54 1234567890
55 .CE
56 .LP
57 This command outputs no more than the first 10 characters of
58 \f(CW$long\fP. (Alternatively, on could use \f(CWprintf
59 "%.10s\\n" "$long"\fP for this job.)
60 .PP
61 However, if it's not about displaying characters but about their
62 storing, then \f(CW-c\fP is only partly suited. In former times,
63 when US-ASCII had been the omnipresent character encoding, each
64 character was stored with exactly one byte. Therefore, \f(CWcut
65 -c\fP selected both, output characters and bytes, equally. With
66 the uprise of multi-byte encodings (like UTF-8), this assumption
67 became obsolete. Consequently, a byte mode (option \f(CW-b\fP)
68 was added to cut, with POSIX.2-1992. To select the first up to
69 500 bytes of each line (and ignore the rest), one can use:
70 .CS
71 $ cut -b -500
72 .CE
73 .LP
74 The remainder can be caught with \f(CWcut -b 501-\fP. This
75 possibility is important for POSIX, because it allows to create
76 text files with limited line length
77 .[[ http://pubs.opengroup.org/onlinepubs/9699919799/utilities/cut.html#tag_20_28_17 .
78 .PP
79 Although the byte mode was newly introduced, it was meant to
80 behave exactly as the old character mode. The character mode,
81 however, had to be implemented differently. In consequence,
82 the problem wasn't to support the byte mode, but to support the
83 new character mode correctly.
84 .PP
85 Besides the character and byte modes, cut has the field mode,
86 which is activated by \f(CW-f\fP. It selects fields from the
87 input. The delimiting character (by default, the tab) may be
88 changed using \f(CW-d\fP. It applies to the input as well as to
89 the output.
90 .PP
91 The typical example for the use of cut's field mode is the
92 selection of information from the passwd file. Here, for
93 instance, the username and its uid:
94 .CS
95 $ cut -d: -f1,3 /etc/passwd
96 root:0
97 bin:1
98 daemon:2
99 mail:8
100 ...
101 .CE
102 .LP
103 (The values to the command line switches may be appended directly
104 to them or separated by whitespace.)
105 .PP
106 The field mode is suited for simple tabulary data, like the
107 passwd file. Beyond that, it soon reaches its limits. Especially,
108 the typical case of whitespace-separated fields is covered poorly
109 by it. Cut's delimiter is exactly one character,
110 therefore one may not split at both, space and tab characters.
111 Furthermore, multiple adjacent delimiter characters lead to
112 empty fields. This is not the expected behavior for
113 the processing of whitespace-separated fields. Some
114 implementations, e.g. the one of FreeBSD, have extensions that
115 handle this case in the expected way. Apart from that, i.e.
116 if one likes to stay portable, awk comes to rescue.
117 .PP
118 Awk provides another function that cut misses: Changing the order
119 of the fields in the output. For cut, the order of the field
120 selection specification is irrelevant; it doesn't even matter if
121 fields are given multiple times. Thus, the invocation
122 \f(CWcut -c 5-8,1,4-6\fP outputs the characters number
123 1, 4, 5, 6, 7 and 8 in exactly this order. The
124 selection is like in the mathematical set theory: Each
125 specified field is part of the solution set. The fields in the
126 solution set are always in the same order as in the input. To
127 speak with the words of the man page in Version 8 Unix:
128 ``In data base parlance, it projects a relation.''
129 .[[ http://man.cat-v.org/unix_8th/1/cut
130 This means, cut applies the database operation \fIprojection\fP
131 to the text input. Wikipedia explains it in the following way:
132 ``In practical terms, it can be roughly thought of as picking a
133 sub-set of all available columns.''
134 .[[ https://en.wikipedia.org/wiki/Projection_(relational_algebra)
136 .SH
137 Historical Background
138 .LP
139 Cut came to public life in 1982 with the release of UNIX System
140 III. Browsing through the sources of System III, one finds cut.c
141 with the timestamp 1980-04-11
142 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=SysIII/usr/src/cmd .
143 This is the oldest implementation of the program, I was able to
144 discover. However, the SCCS-ID in the source code speaks of
145 version 1.5. According to Doug McIlroy
146 .[[ http://minnie.tuhs.org/pipermail/tuhs/2015-May/004083.html ,
147 the earlier history likely lays in PWB/UNIX, which was the
148 basis for System III. In the available sources of PWB 1.0 (1977)
149 .[[ http://minnie.tuhs.org/Archive/PDP-11/Distributions/usdl/ ,
150 no cut is present. Of PWB 2.0, no sources or useful documentation
151 seem to be available. PWB 3.0 was later renamed to System III
152 for marketing purposes, hence it is identical to it. A side line
153 of PWB was CB UNIX, which was only used in the Bell Labs
154 internally. The manual of CB UNIX Edition 2.1 of November 1979
155 contains the earliest mentioning of cut, that my research brought
156 to light: A man page for it
157 .[[ ftp://sunsite.icm.edu.pl/pub/unix/UnixArchive/PDP-11/Distributions/other/CB_Unix/cbunix_man1_02.pdf .
158 .PP
159 Now a look on BSD: There, my earliest discovery is a cut.c with
160 the file modification date of 1986-11-07
161 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-UWisc/src/usr.bin/cut
162 as part of the special version 4.3BSD-UWisc
163 .[[ http://gunkies.org/wiki/4.3_BSD_NFS_Wisconsin_Unix ,
164 which was released in January 1987.
165 This implementation is mostly identical to the one in System
166 III. The better known 4.3BSD-Tahoe (1988) does not contain cut.
167 The following 4.3BSD-Reno (1990) does include cut. It is a freshly
168 written one by Adam S. Moskowitz and Marciano Pitargue, which was
169 included in BSD in 1989
170 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut .
171 Its man page
172 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut/cut.1
173 already mentions the expected compliance to POSIX.2.
174 One should note that POSIX.2 was first published in
175 September 1992, about two years after the man page and the
176 program were written. Hence, the program must have been
177 implemented based on a draft version of the standard. A look into
178 the code confirms the assumption. The function to parse the field
179 selection includes the following comment:
180 .QP
181 This parser is less restrictive than the Draft 9 POSIX spec.
182 POSIX doesn't allow lists that aren't in increasing order or
183 overlapping lists.
184 .LP
185 Draft 11.2 of POSIX (1991-09) requires this flexibility already:
186 .QP
187 The elements in list can be repeated, can overlap, and can
188 be specified in any order.
189 .LP
190 The same draft additionally includes all three operation modes,
191 whereas this early BSD cut only implemented the original two.
192 Draft 9 might not have included the byte mode. Without access to
193 Draft 9 or 10, it wasn't possible to verify this guess.
194 .PP
195 The version numbers and change dates of the older BSD
196 implementations are manifested in the SCCS-IDs, which the
197 version control system of that time inserted. For instance
198 in 4.3BSD-Reno: ``5.3 (Berkeley) 6/24/90''.
199 .PP
200 The cut implementation of the GNU coreutils contains the
201 following copyright notice:
202 .CS
203 Copyright (C) 1997-2015 Free Software Foundation, Inc.
204 Copyright (C) 1984 David M. Ihnat
205 .CE
206 .LP
207 The code does have pretty old origins. Further comments show that
208 the source code was reworked by David MacKenzie first and later
209 by Jim Meyering, who put it into the version control system in
210 1992. It is unclear, why the years until 1997, at least from
211 1992 on, don't show up in the copyright notice.
212 .PP
213 Despite all those year numbers from the 80s, cut is a rather
214 young tool, at least in relation to the early Unix. Despite
215 being a decade older than Linux, the kernel, Unix had been
216 present for over ten years until cut appeared for the first
217 time. Most notably, cut wasn't part of Version 7 Unix, which
218 became the basis for all modern Unix systems. The more complex
219 tools sed and awk had been part of it already. Hence, the
220 question comes to mind, why cut was written at all, as there
221 existed two programs that were able to cover the use cases of
222 cut. On reason for cut surely was its compactness and the
223 resulting speed, in comparison to the then bulky awk. This lean
224 shape goes well with the Unix philosopy: Do one job and do it
225 well! Cut convinced. It found it's way to other Unix variants,
226 it became standardized and today it is present everywhere.
227 .PP
228 The original variant (without \f(CW-b\fP) was described by the
229 System V Interface Defintion, an important formal description
230 of UNIX System V, already in 1985. In the following years, it
231 appeared in all relevant standards. POSIX.2 in 1992 specified
232 cut for the first time in its modern form (with \f(CW-b\fP).
234 .SH
235 Multi-byte support
236 .LP
237 The byte mode and thus the multi-byte support of
238 the POSIX character mode are standardized since 1992. But
239 how about their presence in the available implementations?
240 Which versions do implement POSIX correctly?
241 .PP
242 The situation is divided in three parts: There are historic
243 implementations, which have only \f(CW-c\fP and \f(CW-f\fP.
244 Then there are implementations, which have \f(CW-b\fP but
245 treat it as an alias for \f(CW-c\fP only. These
246 implementations work correctly for single-byte encodings
247 (e.g. US-ASCII, Latin1) but for multi-byte encodings (e.g.
248 UTF-8) their \f(CW-c\fP behaves like \f(CW-b\fP (and
249 \f(CW-n\fP is ignored). Finally, there are implementations
250 that implement \f(CW-b\fP and \f(CW-c\fP POSIX-compliant.
251 .PP
252 Historic two-mode implementations are the ones of
253 System III, System V and the BSD ones until the mid-90s.
254 .PP
255 Pseudo multi-byte implementations are provided by GNU and
256 modern NetBSD and OpenBSD. The level of POSIX compliance
257 that is presented there is often higher than the level of
258 compliance that is actually provided. Sometimes it takes a
259 close look to discover that \f(CW-c\fP and \f(CW-n\fP don't
260 behave as expected. Some of the implementations take the
261 easy way by simply being ignorant to any multi-byte
262 encodings, at least they tell that clearly:
263 .QP
264 Since we don't support multi-byte characters, the \f(CW-c\fP and \f(CW-b\fP
265 options are equivalent, and the \f(CW-n\fP option is meaningless.
266 .[[ http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/cut/cut.c?rev=1.18&content-type=text/x-cvsweb-markup
267 .LP
268 Standard-adhering implementations, ones that treat
269 multi-byte characters correctly, are the one of the modern
270 FreeBSD and the one in the Heirloom toolchest. Tim Robbins
271 reimplemented the character mode of FreeBSD cut,
272 conforming to POSIX, in summer 2004
273 .[[ https://svnweb.freebsd.org/base?view=revision&revision=131194 .
274 The question, why the other BSD systems have not
275 integrated this change, is an open one. Maybe the answer an be
276 found in the above quoted statement.
277 .PP
278 How does a user find out if the cut on the own system handles
279 multi-byte characters correclty? First, one needs to check if
280 the system itself uses multi-byte characters, because otherwise
281 characters and bytes are equivalent and the question
282 is irrelevant. One can check this by looking at the locale
283 settings, but it is easier to print a typical multi-byte
284 character, for instance an Umlaut or the Euro currency
285 symbol, and check if one or more bytes are output:
286 .CS
287 $ echo ä | od -c
288 0000000 303 244 \\n
289 0000003
290 .CE
291 .LP
292 In this case it were two bytes: octal 303 and 244. (The
293 Newline character is added by echo.)
294 .PP
295 The program iconv converts text to specific encodings. This
296 is the output for Latin1 and UTF-8, for comparison:
297 .CS
298 $ echo ä | iconv -t latin1 | od -c
299 0000000 344 \\n
300 0000002
301 .sp .3
302 $ echo ä | iconv -t utf8 | od -c
303 0000000 303 244 \\n
304 0000003
305 .CE
306 .LP
307 The output (without the iconv conversion) on many European
308 systems equals one of these two.
309 .PP
310 Now the test of the cut implementation. On a UTF-8 system, a
311 POSIX compliant implementation behaves as such:
312 .CS
313 $ echo ä | cut -c 1 | od -c
314 0000000 303 244 \\n
315 0000003
316 .sp .3
317 $ echo ä | cut -b 1 | od -c
318 0000000 303 \\n
319 0000002
320 .sp .3
321 $ echo ä | cut -b 1 -n | od -c
322 0000000 \\n
323 0000001
324 .CE
325 .LP
326 A pseudo POSIX implementation, in contrast, behaves like the
327 middle one, for all three invocations: Only the first byte is
328 output.
330 .SH
331 Implementations
332 .LP
333 Let's take a look at the sources of a selection of
334 implementations.
335 .PP
336 A comparison of the amount of source code is good to get a first
337 impression. Typically, it grows through time. This can be seen
338 here, in general but not in all cases. A POSIX-compliant
339 implementation of the character mode requires more code, thus
340 these implementations are rather the larger ones.
341 .TS
342 center;
343 r r r l l l.
344 SLOC Lines Bytes Belongs to File tyime Category
345 _
346 116 123 2966 System III 1980-04-11 historic
347 118 125 3038 4.3BSD-UWisc 1986-11-07 historic
348 200 256 5715 4.3BSD-Reno 1990-06-25 historic
349 200 270 6545 NetBSD 1993-03-21 historic
350 218 290 6892 OpenBSD 2008-06-27 pseudo-POSIX
351 224 296 6920 FreeBSD 1994-05-27 historic
352 232 306 7500 NetBSD 2014-02-03 pseudo-POSIX
353 340 405 7423 Heirloom 2012-05-20 POSIX
354 382 586 14175 GNU coreutils 1992-11-08 pseudo-POSIX
355 391 479 10961 FreeBSD 2012-11-24 POSIX
356 588 830 23167 GNU coreutils 2015-05-01 pseudo-POSIX
357 .TE
358 .LP
359 Roughly four groups can be seen: (1) The two original
360 implementaions, which are mostly identical, with about 100
361 SLOC. (2) The five BSD versions, with about 200 SLOC. (3) The
362 two POSIX-compliant versions and the old GNU one, with a SLOC
363 count in the 300s. And finally (4) the modern GNU cut with
364 almost 600 SLOC.
365 .PP
366 The variation between the number of logical code
367 lines (SLOC, meassured with SLOCcount) and the number of
368 Newlines in the file (\f(CWwc -l\fP) spans between factor
369 1.06 for the oldest versions and factor 1.5 for GNU. The
370 largest influence on it are empty lines, pure comment lines
371 and the size of the license block at the beginning of the file.
372 .PP
373 Regarding the variation between logical code lines and the
374 file size (\f(CWwc -c\fP), the implementations span between
375 25 and 30 bytes per statement. With only 21 bytes per
376 statement, the Heirloom implementation marks the lower end;
377 the GNU implementation sets the upper limit at nearly 40. In
378 the case of GNU, the reason is mainly their coding style, with
379 special indent rules and long identifiers. Whether one finds
380 the Heirloom implementation
381 .[[ http://heirloom.cvs.sourceforge.net/viewvc/heirloom/heirloom/cut/cut.c?revision=1.6&view=markup
382 highly cryptic or exceptionally elegant, shall be left
383 open to the judgement of the reader. Especially the
384 comparison to the GNU implementation
385 .[[ http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/cut.c;hb=e981643
386 is impressive.
387 .PP
388 The internal structure of the source code (in all cases it is
389 written in C) is mainly similar. Besides the mandatory main
390 function, which does the command line argument processing,
391 there usually exists a function to convert the field
392 selection specification to an internal data structure.
393 Further more, almost all implementations have separate
394 functions for each of their operation modes. The POSIX-compliant
395 versions treat the \f(CW-b -n\fP combination as a separate
396 mode and thus implement it in an own function. Only the early
397 System III implementation (and its 4.3BSD-UWisc variant) do
398 everything, apart from error handling, in the main function.
399 .PP
400 Implementations of cut typically have two limiting aspects:
401 One being the maximum number of fields that can be handled,
402 the other being the maximum line length. On System III, both
403 numbers are limited to 512. 4.3BSD-Reno and the BSDs of the
404 90s have fixed limits as well (\f(CW_BSD_LINE_MAX\fP or
405 \f(CW_POSIX2_LINE_MAX\fP). Modern FreeBSD, NetBSD, all GNU
406 implementations and the Heirloom cut is able to handle
407 arbitrary numbers of fields and line lengths \(en the memory
408 is allocated dynamically. OpenBSD cut is a hybrid: It has a fixed
409 maximum number of fields, but allows arbitrary line lengths.
410 The limited number of fields does, however, not appear to be
411 any practical problem, because \f(CW_POSIX2_LINE_MAX\fP is
412 guaranteed to be at least 2048 and is thus probably large enough.
414 .SH
415 Descriptions
416 .LP
417 Interesting, as well, is a comparison of the short descriptions
418 of cut, as can be found in the headlines of the man
419 pages or at the beginning of the source code files.
420 The following list is roughly sorted by time and grouped by
421 decent:
422 .TS
423 center;
424 l l.
425 CB UNIX cut out selected fields of each line of a file
426 System III cut out selected fields of each line of a file
427 System III \(dg cut and paste columns of a table (projection of a relation)
428 System V cut out selected fields of each line of a file
429 HP-UX cut out (extract) selected fields of each line of a file
430 .sp .3
431 4.3BSD-UWisc \(dg cut and paste columns of a table (projection of a relation)
432 4.3BSD-Reno select portions of each line of a file
433 NetBSD select portions of each line of a file
434 OpenBSD 4.6 select portions of each line of a file
435 FreeBSD 1.0 select portions of each line of a file
436 FreeBSD 10.0 cut out selected portions of each line of a file
437 SunOS 4.1.3 remove selected fields from each line of a file
438 SunOS 5.5.1 cut out selected fields of each line of a file
439 .sp .3
440 Heirloom Tools cut out selected fields of each line of a file
441 Heirloom Tools \(dg cut out fields of lines of files
442 .sp .3
443 GNU coreutils remove sections from each line of files
444 .sp .3
445 Minix select out columns of a file
446 .sp .3
447 Version 8 Unix rearrange columns of data
448 ``Unix Reader'' rearrange columns of text
449 .sp .3
450 POSIX cut out selected fields of each line of a file
451 .TE
452 .LP
453 (The descriptions that are marked with `\(dg' were taken from
454 source code files. The POSIX entry contains the description
455 used in the standard. The ``Unix Reader'' is a retrospective
456 document by Doug McIlroy, which lists the availability of
457 tools in the Research Unix versions
458 .[[ http://doc.cat-v.org/unix/unix-reader/contents.pdf .
459 Its description should actually match the one in Version 8
460 Unix. The change could be a transfer mistake or a correction.
461 All other descriptions originate from the various man pages.)
462 .PP
463 Over time, the POSIX description was often adopted or it
464 served as inspiration. One such example is FreeBSD
465 .[[ https://svnweb.freebsd.org/base?view=revision&revision=167101 .
466 .PP
467 It is noteworthy that the GNU coreutils in all versions
468 describe the performed action as a removal of parts of the
469 input, although the user clearly selects the parts that are
470 output. Probably the words ``cut out'' are too misleading.
471 HP-UX concretized them.
472 .PP
473 There are also different terms used for the thing being
474 selected. Some talk about fields (POSIX), some talk
475 about portions (BSD) and some call it columns (Research
476 Unix).
477 .PP
478 The seemingly least adequate description, the one of Version
479 8 Unix (``rearrange columns of data'') is explainable in so
480 far that the man page covers both, cut and paste, and in
481 their combination, columns can be rearranged. The use of
482 ``data'' instead of ``text'' might be a lapse, which McIlroy
483 corrected in his Unix Reader ... but, on the other hand, on
484 Unix, the two words are mostly synonymous, because all data
485 is text.
488 .SH
489 Referenzen
490 .LP
491 .nf
492 ._r