Mercurial > docs > cut
comparison cut.en.ms @ 27:5cefcfc72d42
Added first version of the translation to English
author | markus schnalke <meillo@marmaro.de> |
---|---|
date | Tue, 04 Aug 2015 21:04:10 +0200 |
parents | |
children | 0d7329867dd1 |
comparison
equal
deleted
inserted
replaced
26:3b4e53e04958 | 27:5cefcfc72d42 |
---|---|
1 .so macros | |
2 .lc_ctype en_US.utf8 | |
3 .pl -4v | |
4 | |
5 .TL | |
6 Cut out selected fields of each line of a file | |
7 .AU | |
8 markus schnalke <meillo@marmaro.de> | |
9 .. | |
10 .FS | |
11 2015-05. | |
12 This text is in the public domain (CC0). | |
13 It is available online: | |
14 .I http://marmaro.de/docs/ | |
15 .FE | |
16 | |
17 .LP | |
18 Cut is a classic program in the Unix toolchest. | |
19 It is present in most tutorials on shell programming, because it | |
20 is such a nice and useful tool which good explanationary value. | |
21 This text shall take a look behind its surface. | |
22 .SH | |
23 Usage | |
24 .LP | |
25 Initially, cut had two operation modes, which were amended by a | |
26 third one, later. Cut may cut specified characters out of the | |
27 input lines or it may cut out specified fields, which are defined | |
28 by a delimiting character. | |
29 .PP | |
30 The character mode is well suited to slice fixed-width input | |
31 formats into parts. One might, for instance, extract the access | |
32 rights from the output of \f(CWls -l\fP, here the rights of the | |
33 file's owner: | |
34 .CS | |
35 $ ls -l foo | |
36 -rw-rw-r-- 1 meillo users 0 May 12 07:32 foo | |
37 .sp .3 | |
38 $ ls -l foo | cut -c 2-4 | |
39 rw- | |
40 .CE | |
41 .LP | |
42 Or the write permission for the owner, the group and the | |
43 world: | |
44 .CS | |
45 $ ls -l foo | cut -c 3,6,9 | |
46 ww- | |
47 .CE | |
48 .LP | |
49 Cut can also be used to shorten strings: | |
50 .CS | |
51 $ long=12345678901234567890 | |
52 .sp .3 | |
53 $ echo "$long" | cut -c -10 | |
54 1234567890 | |
55 .CE | |
56 .LP | |
57 This command outputs no more than the first 10 characters of | |
58 \f(CW$long\fP. (Alternatively, on could use \f(CWprintf | |
59 "%.10s\\n" "$long"\fP for this job.) | |
60 .PP | |
61 However, if it's not about displaying characters but about their | |
62 storing, then \f(CW-c\fP is only partly suited. In former times, | |
63 when US-ASCII had been the omnipresent character encoding, each | |
64 character was stored with exactly one byte. Therefore, \f(CWcut | |
65 -c\fP selected both, output characters and bytes, equally. With | |
66 the uprise of multi-byte encodings (like UTF-8), this assumption | |
67 became obsolete. Consequently, a byte mode (option \f(CW-b\fP) | |
68 was added to cut, with POSIX.2-1992. To select the first up to | |
69 500 bytes of each line (and ignore the rest), one can use: | |
70 .CS | |
71 $ cut -b -500 | |
72 .CE | |
73 .LP | |
74 The remainder can be caught with \f(CWcut -b 501-\fP. This | |
75 possibility is important for POSIX, because it allows to create | |
76 text files with limited line length | |
77 .[[ http://pubs.opengroup.org/onlinepubs/9699919799/utilities/cut.html#tag_20_28_17 . | |
78 .PP | |
79 Although the byte mode was newly introduced, it was meant to | |
80 behave exactly as the old character mode. The character mode, | |
81 however, had to be implemented differently. In consequence, | |
82 the problem wasn't to support the byte mode, but to support the | |
83 new character mode correctly. | |
84 .PP | |
85 Besides the character and byte modes, cut has the field mode, | |
86 which is activated by \f(CW-f\fP. It selects fields from the | |
87 input. The delimiting character (by default, the tab) may be | |
88 changed using \f(CW-d\fP. It applies to the input as well as to | |
89 the output. | |
90 .PP | |
91 The typical example for the use of cut's field mode is the | |
92 selection of information from the passwd file. Here, for | |
93 instance, the username and its uid: | |
94 .CS | |
95 $ cut -d: -f1,3 /etc/passwd | |
96 root:0 | |
97 bin:1 | |
98 daemon:2 | |
99 mail:8 | |
100 ... | |
101 .CE | |
102 .LP | |
103 (The values to the command line switches may be appended directly | |
104 to them or separated by whitespace.) | |
105 .PP | |
106 The field mode is suited for simple tabulary data, like the | |
107 passwd file. Beyond that, it soon reaches its limits. Especially, | |
108 the typical case of whitespace-separated fields is covered poorly | |
109 by it. Cut's delimiter is exactly one character, | |
110 therefore one may not split at both, space and tab characters. | |
111 Furthermore, multiple adjacent delimiter characters lead to | |
112 empty fields. This is not the expected behavior for | |
113 the processing of whitespace-separated fields. Some | |
114 implementations, e.g. the one of FreeBSD, have extensions that | |
115 handle this case in the expected way. Apart from that, i.e. | |
116 if one likes to stay portable, awk comes to rescue. | |
117 .PP | |
118 Awk provides another function that cut misses: Changing the order | |
119 of the fields in the output. For cut, the order of the field | |
120 selection specification is irrelevant; it doesn't even matter if | |
121 fields are given multiple times. Thus, the invocation | |
122 \f(CWcut -c 5-8,1,4-6\fP outputs the characters number | |
123 1, 4, 5, 6, 7 and 8 in exactly this order. The | |
124 selection is like in the mathematical set theory: Each | |
125 specified field is part of the solution set. The fields in the | |
126 solution set are always in the same order as in the input. To | |
127 speak with the words of the man page in Version 8 Unix: | |
128 ``In data base parlance, it projects a relation.'' | |
129 .[[ http://man.cat-v.org/unix_8th/1/cut | |
130 This means, cut applies the database operation \fIprojection\fP | |
131 to the text input. Wikipedia explains it in the following way: | |
132 ``In practical terms, it can be roughly thought of as picking a | |
133 sub-set of all available columns.'' | |
134 .[[ https://en.wikipedia.org/wiki/Projection_(relational_algebra) | |
135 | |
136 .SH | |
137 Historical Background | |
138 .LP | |
139 Cut came to public life in 1982 with the release of UNIX System | |
140 III. Browsing through the sources of System III, one finds cut.c | |
141 with the timestamp 1980-04-11 | |
142 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=SysIII/usr/src/cmd . | |
143 This is the oldest implementation of the program, I was able to | |
144 discover. However, the SCCS-ID in the source code speaks of | |
145 version 1.5. According to Doug McIlroy | |
146 .[[ http://minnie.tuhs.org/pipermail/tuhs/2015-May/004083.html , | |
147 the earlier history likely lays in PWB/UNIX, which was the | |
148 basis for System III. In the available sources of PWB 1.0 (1977) | |
149 .[[ http://minnie.tuhs.org/Archive/PDP-11/Distributions/usdl/ , | |
150 no cut is present. Of PWB 2.0, no sources or useful documentation | |
151 seem to be available. PWB 3.0 was later renamed to System III | |
152 for marketing purposes, hence it is identical to it. A side line | |
153 of PWB was CB UNIX, which was only used in the Bell Labs | |
154 internally. The manual of CB UNIX Edition 2.1 of November 1979 | |
155 contains the earliest mentioning of cut, that my research brought | |
156 to light: A man page for it | |
157 .[[ ftp://sunsite.icm.edu.pl/pub/unix/UnixArchive/PDP-11/Distributions/other/CB_Unix/cbunix_man1_02.pdf . | |
158 .PP | |
159 Now a look on BSD: There, my earliest discovery is a cut.c with | |
160 the file modification date of 1986-11-07 | |
161 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-UWisc/src/usr.bin/cut | |
162 as part of the special version 4.3BSD-UWisc | |
163 .[[ http://gunkies.org/wiki/4.3_BSD_NFS_Wisconsin_Unix , | |
164 which was released in January 1987. | |
165 This implementation is mostly identical to the one in System | |
166 III. The better known 4.3BSD-Tahoe (1988) does not contain cut. | |
167 The following 4.3BSD-Reno (1990) does include cut. It is a freshly | |
168 written one by Adam S. Moskowitz and Marciano Pitargue, which was | |
169 included in BSD in 1989 | |
170 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut . | |
171 Its man page | |
172 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut/cut.1 | |
173 already mentions the expected compliance to POSIX.2. | |
174 One should note that POSIX.2 was first published in | |
175 September 1992, about two years after the man page and the | |
176 program were written. Hence, the program must have been | |
177 implemented based on a draft version of the standard. A look into | |
178 the code confirms the assumption. The function to parse the field | |
179 selection includes the following comment: | |
180 .QP | |
181 This parser is less restrictive than the Draft 9 POSIX spec. | |
182 POSIX doesn't allow lists that aren't in increasing order or | |
183 overlapping lists. | |
184 .LP | |
185 Draft 11.2 of POSIX (1991-09) requires this flexibility already: | |
186 .QP | |
187 The elements in list can be repeated, can overlap, and can | |
188 be specified in any order. | |
189 .LP | |
190 The same draft additionally includes all three operation modes, | |
191 whereas this early BSD cut only implemented the original two. | |
192 Draft 9 might not have included the byte mode. Without access to | |
193 Draft 9 or 10, it wasn't possible to verify this guess. | |
194 .PP | |
195 The version numbers and change dates of the older BSD | |
196 implementations are manifested in the SCCS-IDs, which the | |
197 version control system of that time inserted. For instance | |
198 in 4.3BSD-Reno: ``5.3 (Berkeley) 6/24/90''. | |
199 .PP | |
200 The cut implementation of the GNU coreutils contains the | |
201 following copyright notice: | |
202 .CS | |
203 Copyright (C) 1997-2015 Free Software Foundation, Inc. | |
204 Copyright (C) 1984 David M. Ihnat | |
205 .CE | |
206 .LP | |
207 The code does have pretty old origins. Further comments show that | |
208 the source code was reworked by David MacKenzie first and later | |
209 by Jim Meyering, who put it into the version control system in | |
210 1992. It is unclear, why the years until 1997, at least from | |
211 1992 on, don't show up in the copyright notice. | |
212 .PP | |
213 Despite all those year numbers from the 80s, cut is a rather | |
214 young tool, at least in relation to the early Unix. Despite | |
215 being a decade older than Linux, the kernel, Unix had been | |
216 present for over ten years until cut appeared for the first | |
217 time. Most notably, cut wasn't part of Version 7 Unix, which | |
218 became the basis for all modern Unix systems. The more complex | |
219 tools sed and awk had been part of it already. Hence, the | |
220 question comes to mind, why cut was written at all, as there | |
221 existed two programs that were able to cover the use cases of | |
222 cut. On reason for cut surely was its compactness and the | |
223 resulting speed, in comparison to the then bulky awk. This lean | |
224 shape goes well with the Unix philosopy: Do one job and do it | |
225 well! Cut convinced. It found it's way to other Unix variants, | |
226 it became standardized and today it is present everywhere. | |
227 .PP | |
228 The original variant (without \f(CW-b\fP) was described by the | |
229 System V Interface Defintion, an important formal description | |
230 of UNIX System V, already in 1985. In the following years, it | |
231 appeared in all relevant standards. POSIX.2 in 1992 specified | |
232 cut for the first time in its modern form (with \f(CW-b\fP). | |
233 | |
234 .SH | |
235 Multi-byte support | |
236 .LP | |
237 The byte mode and thus the multi-byte support of | |
238 the POSIX character mode are standardized since 1992. But | |
239 how about their presence in the available implementations? | |
240 Which versions do implement POSIX correctly? | |
241 .PP | |
242 The situation is divided in three parts: There are historic | |
243 implementations, which have only \f(CW-c\fP and \f(CW-f\fP. | |
244 Then there are implementations, which have \f(CW-b\fP but | |
245 treat it as an alias for \f(CW-c\fP only. These | |
246 implementations work correctly for single-byte encodings | |
247 (e.g. US-ASCII, Latin1) but for multi-byte encodings (e.g. | |
248 UTF-8) their \f(CW-c\fP behaves like \f(CW-b\fP (and | |
249 \f(CW-n\fP is ignored). Finally, there are implementations | |
250 that implement \f(CW-b\fP and \f(CW-c\fP POSIX-compliant. | |
251 .PP | |
252 Historic two-mode implementations are the ones of | |
253 System III, System V and the BSD ones until the mid-90s. | |
254 .PP | |
255 Pseudo multi-byte implementations are provided by GNU and | |
256 modern NetBSD and OpenBSD. The level of POSIX compliance | |
257 that is presented there is often higher than the level of | |
258 compliance that is actually provided. Sometimes it takes a | |
259 close look to discover that \f(CW-c\fP and \f(CW-n\fP don't | |
260 behave as expected. Some of the implementations take the | |
261 easy way by simply being ignorant to any multi-byte | |
262 encodings, at least they tell that clearly: | |
263 .QP | |
264 Since we don't support multi-byte characters, the \f(CW-c\fP and \f(CW-b\fP | |
265 options are equivalent, and the \f(CW-n\fP option is meaningless. | |
266 .[[ http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/cut/cut.c?rev=1.18&content-type=text/x-cvsweb-markup | |
267 .LP | |
268 Standard-adhering implementations, ones that treat | |
269 multi-byte characters correctly, are the one of the modern | |
270 FreeBSD and the one in the Heirloom toolchest. Tim Robbins | |
271 reimplemented the character mode of FreeBSD cut, | |
272 conforming to POSIX, in summer 2004 | |
273 .[[ https://svnweb.freebsd.org/base?view=revision&revision=131194 . | |
274 The question, why the other BSD systems have not | |
275 integrated this change, is an open one. Maybe the answer an be | |
276 found in the above quoted statement. | |
277 .PP | |
278 How does a user find out if the cut on the own system handles | |
279 multi-byte characters correclty? First, one needs to check if | |
280 the system itself uses multi-byte characters, because otherwise | |
281 characters and bytes are equivalent and the question | |
282 is irrelevant. One can check this by looking at the locale | |
283 settings, but it is easier to print a typical multi-byte | |
284 character, for instance an Umlaut or the Euro currency | |
285 symbol, and check if one or more bytes are output: | |
286 .CS | |
287 $ echo ä | od -c | |
288 0000000 303 244 \\n | |
289 0000003 | |
290 .CE | |
291 .LP | |
292 In this case it were two bytes: octal 303 and 244. (The | |
293 Newline character is added by echo.) | |
294 .PP | |
295 The program iconv converts text to specific encodings. This | |
296 is the output for Latin1 and UTF-8, for comparison: | |
297 .CS | |
298 $ echo ä | iconv -t latin1 | od -c | |
299 0000000 344 \\n | |
300 0000002 | |
301 .sp .3 | |
302 $ echo ä | iconv -t utf8 | od -c | |
303 0000000 303 244 \\n | |
304 0000003 | |
305 .CE | |
306 .LP | |
307 The output (without the iconv conversion) on many European | |
308 systems equals one of these two. | |
309 .PP | |
310 Now the test of the cut implementation. On a UTF-8 system, a | |
311 POSIX compliant implementation behaves as such: | |
312 .CS | |
313 $ echo ä | cut -c 1 | od -c | |
314 0000000 303 244 \\n | |
315 0000003 | |
316 .sp .3 | |
317 $ echo ä | cut -b 1 | od -c | |
318 0000000 303 \\n | |
319 0000002 | |
320 .sp .3 | |
321 $ echo ä | cut -b 1 -n | od -c | |
322 0000000 \\n | |
323 0000001 | |
324 .CE | |
325 .LP | |
326 A pseudo POSIX implementation, in contrast, behaves like the | |
327 middle one, for all three invocations: Only the first byte is | |
328 output. | |
329 | |
330 .SH | |
331 Implementations | |
332 .LP | |
333 Let's take a look at the sources of a selection of | |
334 implementations. | |
335 .PP | |
336 A comparison of the amount of source code is good to get a first | |
337 impression. Typically, it grows through time. This can be seen | |
338 here, in general but not in all cases. A POSIX-compliant | |
339 implementation of the character mode requires more code, thus | |
340 these implementations are rather the larger ones. | |
341 .TS | |
342 center; | |
343 r r r l l l. | |
344 SLOC Lines Bytes Belongs to File tyime Category | |
345 _ | |
346 116 123 2966 System III 1980-04-11 historic | |
347 118 125 3038 4.3BSD-UWisc 1986-11-07 historic | |
348 200 256 5715 4.3BSD-Reno 1990-06-25 historic | |
349 200 270 6545 NetBSD 1993-03-21 historic | |
350 218 290 6892 OpenBSD 2008-06-27 pseudo-POSIX | |
351 224 296 6920 FreeBSD 1994-05-27 historic | |
352 232 306 7500 NetBSD 2014-02-03 pseudo-POSIX | |
353 340 405 7423 Heirloom 2012-05-20 POSIX | |
354 382 586 14175 GNU coreutils 1992-11-08 pseudo-POSIX | |
355 391 479 10961 FreeBSD 2012-11-24 POSIX | |
356 588 830 23167 GNU coreutils 2015-05-01 pseudo-POSIX | |
357 .TE | |
358 .LP | |
359 Roughly four groups can be seen: (1) The two original | |
360 implementaions, which are mostly identical, with about 100 | |
361 SLOC. (2) The five BSD versions, with about 200 SLOC. (3) The | |
362 two POSIX-compliant versions and the old GNU one, with a SLOC | |
363 count in the 300s. And finally (4) the modern GNU cut with | |
364 almost 600 SLOC. | |
365 .PP | |
366 The variation between the number of logical code | |
367 lines (SLOC, meassured with SLOCcount) and the number of | |
368 Newlines in the file (\f(CWwc -l\fP) spans between factor | |
369 1.06 for the oldest versions and factor 1.5 for GNU. The | |
370 largest influence on it are empty lines, pure comment lines | |
371 and the size of the license block at the beginning of the file. | |
372 .PP | |
373 Regarding the variation between logical code lines and the | |
374 file size (\f(CWwc -c\fP), the implementations span between | |
375 25 and 30 bytes per statement. With only 21 bytes per | |
376 statement, the Heirloom implementation marks the lower end; | |
377 the GNU implementation sets the upper limit at nearly 40. In | |
378 the case of GNU, the reason is mainly their coding style, with | |
379 special indent rules and long identifiers. Whether one finds | |
380 the Heirloom implementation | |
381 .[[ http://heirloom.cvs.sourceforge.net/viewvc/heirloom/heirloom/cut/cut.c?revision=1.6&view=markup | |
382 highly cryptic or exceptionally elegant, shall be left | |
383 open to the judgement of the reader. Especially the | |
384 comparison to the GNU implementation | |
385 .[[ http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/cut.c;hb=e981643 | |
386 is impressive. | |
387 .PP | |
388 The internal structure of the source code (in all cases it is | |
389 written in C) is mainly similar. Besides the mandatory main | |
390 function, which does the command line argument processing, | |
391 there usually exists a function to convert the field | |
392 selection specification to an internal data structure. | |
393 Further more, almost all implementations have separate | |
394 functions for each of their operation modes. The POSIX-compliant | |
395 versions treat the \f(CW-b -n\fP combination as a separate | |
396 mode and thus implement it in an own function. Only the early | |
397 System III implementation (and its 4.3BSD-UWisc variant) do | |
398 everything, apart from error handling, in the main function. | |
399 .PP | |
400 Implementations of cut typically have two limiting aspects: | |
401 One being the maximum number of fields that can be handled, | |
402 the other being the maximum line length. On System III, both | |
403 numbers are limited to 512. 4.3BSD-Reno and the BSDs of the | |
404 90s have fixed limits as well (\f(CW_BSD_LINE_MAX\fP or | |
405 \f(CW_POSIX2_LINE_MAX\fP). Modern FreeBSD, NetBSD, all GNU | |
406 implementations and the Heirloom cut is able to handle | |
407 arbitrary numbers of fields and line lengths \(en the memory | |
408 is allocated dynamically. OpenBSD cut is a hybrid: It has a fixed | |
409 maximum number of fields, but allows arbitrary line lengths. | |
410 The limited number of fields does, however, not appear to be | |
411 any practical problem, because \f(CW_POSIX2_LINE_MAX\fP is | |
412 guaranteed to be at least 2048 and is thus probably large enough. | |
413 | |
414 .SH | |
415 Descriptions | |
416 .LP | |
417 Interesting, as well, is a comparison of the short descriptions | |
418 of cut, as can be found in the headlines of the man | |
419 pages or at the beginning of the source code files. | |
420 The following list is roughly sorted by time and grouped by | |
421 decent: | |
422 .TS | |
423 center; | |
424 l l. | |
425 CB UNIX cut out selected fields of each line of a file | |
426 System III cut out selected fields of each line of a file | |
427 System III \(dg cut and paste columns of a table (projection of a relation) | |
428 System V cut out selected fields of each line of a file | |
429 HP-UX cut out (extract) selected fields of each line of a file | |
430 .sp .3 | |
431 4.3BSD-UWisc \(dg cut and paste columns of a table (projection of a relation) | |
432 4.3BSD-Reno select portions of each line of a file | |
433 NetBSD select portions of each line of a file | |
434 OpenBSD 4.6 select portions of each line of a file | |
435 FreeBSD 1.0 select portions of each line of a file | |
436 FreeBSD 10.0 cut out selected portions of each line of a file | |
437 SunOS 4.1.3 remove selected fields from each line of a file | |
438 SunOS 5.5.1 cut out selected fields of each line of a file | |
439 .sp .3 | |
440 Heirloom Tools cut out selected fields of each line of a file | |
441 Heirloom Tools \(dg cut out fields of lines of files | |
442 .sp .3 | |
443 GNU coreutils remove sections from each line of files | |
444 .sp .3 | |
445 Minix select out columns of a file | |
446 .sp .3 | |
447 Version 8 Unix rearrange columns of data | |
448 ``Unix Reader'' rearrange columns of text | |
449 .sp .3 | |
450 POSIX cut out selected fields of each line of a file | |
451 .TE | |
452 .LP | |
453 (The descriptions that are marked with `\(dg' were taken from | |
454 source code files. The POSIX entry contains the description | |
455 used in the standard. The ``Unix Reader'' is a retrospective | |
456 document by Doug McIlroy, which lists the availability of | |
457 tools in the Research Unix versions | |
458 .[[ http://doc.cat-v.org/unix/unix-reader/contents.pdf . | |
459 Its description should actually match the one in Version 8 | |
460 Unix. The change could be a transfer mistake or a correction. | |
461 All other descriptions originate from the various man pages.) | |
462 .PP | |
463 Over time, the POSIX description was often adopted or it | |
464 served as inspiration. One such example is FreeBSD | |
465 .[[ https://svnweb.freebsd.org/base?view=revision&revision=167101 . | |
466 .PP | |
467 It is noteworthy that the GNU coreutils in all versions | |
468 describe the performed action as a removal of parts of the | |
469 input, although the user clearly selects the parts that are | |
470 output. Probably the words ``cut out'' are too misleading. | |
471 HP-UX concretized them. | |
472 .PP | |
473 There are also different terms used for the thing being | |
474 selected. Some talk about fields (POSIX), some talk | |
475 about portions (BSD) and some call it columns (Research | |
476 Unix). | |
477 .PP | |
478 The seemingly least adequate description, the one of Version | |
479 8 Unix (``rearrange columns of data'') is explainable in so | |
480 far that the man page covers both, cut and paste, and in | |
481 their combination, columns can be rearranged. The use of | |
482 ``data'' instead of ``text'' might be a lapse, which McIlroy | |
483 corrected in his Unix Reader ... but, on the other hand, on | |
484 Unix, the two words are mostly synonymous, because all data | |
485 is text. | |
486 | |
487 | |
488 .SH | |
489 Referenzen | |
490 .LP | |
491 .nf | |
492 ._r | |
493 |