comparison cut.en.ms @ 31:106609b64dc4

minor corrections and improvements in the text
author markus schnalke <meillo@marmaro.de>
date Tue, 15 Sep 2015 17:20:20 +0200
parents 6977e2ee5dc5
children 5f78bcd34eeb
comparison
equal deleted inserted replaced
30:6977e2ee5dc5 31:106609b64dc4
88 the input. The field-delimiter character for the input as well 88 the input. The field-delimiter character for the input as well
89 as for the output (by default the tab) may be changed using 89 as for the output (by default the tab) may be changed using
90 \f(CW-d\fP. 90 \f(CW-d\fP.
91 .PP 91 .PP
92 The typical example for the use of cut's field mode is the 92 The typical example for the use of cut's field mode is the
93 selection of information from the passwd file. Here, for 93 selection of information from the password file. Here, for
94 instance, the usernames and their uids: 94 instance, the usernames and their uids:
95 .CS 95 .CS
96 $ cut -d: -f1,3 /etc/passwd 96 $ cut -d: -f1,3 /etc/passwd
97 root:0 97 root:0
98 bin:1 98 bin:1
103 .LP 103 .LP
104 (The values to the command line switches may be appended directly 104 (The values to the command line switches may be appended directly
105 to them or separated by whitespace.) 105 to them or separated by whitespace.)
106 .PP 106 .PP
107 The field mode is suited for simple tabulary data, like the 107 The field mode is suited for simple tabulary data, like the
108 passwd file. Beyond that, it soon reaches its limits. The typical 108 password file. Beyond that, it soon reaches its limits. The typical
109 case of whitespace-separated fields, in particular, is covered 109 case of whitespace-separated fields, in particular, is covered
110 poorly by it. Cut's delimiter is exactly one character, 110 poorly by it. Cut's delimiter is exactly one character,
111 therefore one can not split at both space and tab characters. 111 therefore one can not split at both space and tab characters.
112 Furthermore, multiple adjacent delimiter characters lead to 112 Furthermore, multiple adjacent delimiter characters lead to
113 empty fields. This is not the expected behavior for 113 empty fields. This is not the expected behavior for
203 .CS 203 .CS
204 Copyright (C) 1997-2015 Free Software Foundation, Inc. 204 Copyright (C) 1997-2015 Free Software Foundation, Inc.
205 Copyright (C) 1984 David M. Ihnat 205 Copyright (C) 1984 David M. Ihnat
206 .CE 206 .CE
207 .LP 207 .LP
208 The code does have old origins. Further comments show that 208 This code does have old origins. Further comments show that
209 the source code was reworked by David MacKenzie first and later 209 the source code was reworked by David MacKenzie first and later
210 by Jim Meyering, who put it into the version control system in 210 by Jim Meyering, who put it into the version control system in
211 1992. It is unclear why the years until 1997, at least from 211 1992. It is unclear why the years until 1997, at least from
212 1992 onwards, don't show up in the copyright notice. 212 1992 onwards, don't show up in the copyright notice.
213 .PP 213 .PP
214 Despite all those year numbers from the 80s, cut is a rather 214 Despite all those year numbers from the 80s, cut is a rather
215 young tool, at least in relation to the early Unix. Despite 215 young tool, at least in relation to the early Unix. Despite
216 being a decade older than Linux (the kernel), Unix was present 216 being a decade older than Linux (the kernel), Unix was present
217 for over ten years by the time cut appeared for the first 217 for over ten years already by the time cut appeared for the first
218 time. Most notably, cut wasn't part of Version 7 Unix, which 218 time. Most notably, cut wasn't part of Version 7 Unix, which
219 became the basis for all modern Unix systems. The more complex 219 became the basis for all modern Unix systems. The more complex
220 tools sed and awk were part of it already. Hence, the 220 tools sed and awk were part of it already. Hence, the
221 question comes to mind why cut was written at all, as two 221 question comes to mind why cut was written at all, as two
222 programs already existed that were able to cover the use cases of 222 programs already existed that were able to cover its use
223 cut. One reason for cut surely was its compactness and the 223 cases. One reason for cut surely was its compactness and the
224 resulting speed, in comparison to the then-bulky awk. This lean 224 resulting speed, in comparison to the then-bulky awk. This lean
225 shape goes well with the Unix philosopy: Do one job and do it 225 shape goes well with the Unix philosopy: Do one job and do it
226 well! Cut was sufficiently convincing. It found its way to 226 well! Cut was sufficiently convincing. It found its way to
227 other Unix variants, it became standardized, and today it is 227 other Unix variants, it became standardized, and today it is
228 present everywhere. 228 present everywhere.
251 \f(CW-n\fP is ignored). Finally, there are implementations 251 \f(CW-n\fP is ignored). Finally, there are implementations
252 that implement \f(CW-c\fP and \f(CW-b\fP in a POSIX-compliant 252 that implement \f(CW-c\fP and \f(CW-b\fP in a POSIX-compliant
253 way. 253 way.
254 .PP 254 .PP
255 Historic two-mode implementations are the ones of 255 Historic two-mode implementations are the ones of
256 System III, System V, and the BSD ones from the beginning 256 System III, System V, and the BSD ones until the mid-90s.
257 until the mid-90s.
258 .PP 257 .PP
259 Pseudo multi-byte implementations are provided by GNU, 258 Pseudo multi-byte implementations are provided by GNU,
260 modern NetBSD, and modern OpenBSD. The level of POSIX compliance 259 modern NetBSD, and modern OpenBSD. The level of POSIX compliance
261 that is presented there is often higher than the level of 260 that is presented there is often higher than the level of
262 compliance that is actually provided. Sometimes it takes a 261 compliance that is actually provided. Sometimes it takes a
268 Since we don't support multi-byte characters, the \f(CW-c\fP 267 Since we don't support multi-byte characters, the \f(CW-c\fP
269 and \f(CW-b\fP options are equivalent, and the \f(CW-n\fP 268 and \f(CW-b\fP options are equivalent, and the \f(CW-n\fP
270 option is meaningless. 269 option is meaningless.
271 .[[ http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/cut/cut.c?rev=1.18&content-type=text/x-cvsweb-markup 270 .[[ http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/cut/cut.c?rev=1.18&content-type=text/x-cvsweb-markup
272 .LP 271 .LP
273 Standard-adhering implementations, ones that treat 272 Standard-adhering implementations, i.e. ones that treat
274 multi-byte characters correctly, are the one of the modern 273 multi-byte characters correctly, are those of the modern
275 FreeBSD and the one in the Heirloom toolchest. Tim Robbins 274 FreeBSD and the Heirloom toolchest. Tim Robbins
276 reimplemented the character mode of FreeBSD cut, 275 reimplemented the character mode of FreeBSD cut,
277 conforming to POSIX, in the summer of 2004 276 conforming to POSIX, in the summer of 2004
278 .[[ https://svnweb.freebsd.org/base?view=revision&revision=131194 . 277 .[[ https://svnweb.freebsd.org/base?view=revision&revision=131194 .
279 The question why the other BSD systems have not 278 The question why the other BSD systems have not
280 integrated this change is an open one. Maybe the answer an be 279 integrated this change is an open one. Maybe the answer an be
281 found in the above quoted statement. 280 found in the above quoted statement.
282 .PP 281 .PP
283 How does a user find out if the cut on their own system handles 282 How do users find out if the cut on their own system handles
284 multi-byte characters correctly? First, one needs to check if 283 multi-byte characters correctly? First, one needs to check if
285 the system itself uses multi-byte characters, because otherwise 284 the system itself uses multi-byte characters, because otherwise
286 characters and bytes are equivalent and the question 285 characters and bytes are equivalent and the question
287 is irrelevant. One can check this by looking at the locale 286 is irrelevant. One can check this by looking at the locale
288 settings, but it is easier to print a typical multi-byte 287 settings, but it is easier to print a typical multi-byte
345 implementation of the character mode requires more code, thus 344 implementation of the character mode requires more code, thus
346 these implementations tend to be the larger ones. 345 these implementations tend to be the larger ones.
347 .TS 346 .TS
348 center; 347 center;
349 r r r l l l. 348 r r r l l l.
350 SLOC Lines Bytes Belongs to File tyime Category 349 SLOC Lines Bytes Belongs to File time Category
351 _ 350 _
352 116 123 2966 System III 1980-04-11 historic 351 116 123 2966 System III 1980-04-11 historic
353 118 125 3038 4.3BSD-UWisc 1986-11-07 historic 352 118 125 3038 4.3BSD-UWisc 1986-11-07 historic
354 200 256 5715 4.3BSD-Reno 1990-06-25 historic 353 200 256 5715 4.3BSD-Reno 1990-06-25 historic
355 200 270 6545 NetBSD 1993-03-21 historic 354 200 270 6545 NetBSD 1993-03-21 historic
360 382 586 14175 GNU coreutils 1992-11-08 pseudo-POSIX 359 382 586 14175 GNU coreutils 1992-11-08 pseudo-POSIX
361 391 479 10961 FreeBSD 2012-11-24 POSIX 360 391 479 10961 FreeBSD 2012-11-24 POSIX
362 588 830 23167 GNU coreutils 2015-05-01 pseudo-POSIX 361 588 830 23167 GNU coreutils 2015-05-01 pseudo-POSIX
363 .TE 362 .TE
364 .LP 363 .LP
365 Roughly four groups can be seen: (1) The two original 364 There are four rough groups: (1) The two original
366 implementations, which are mostly identical, with about 100 365 implementations, which are mostly identical, with about 100
367 SLOC. (2) The five BSD versions, with about 200 SLOC. (3) The 366 SLOC. (2) The five BSD versions, with about 200 SLOC. (3) The
368 two POSIX-compliant versions and the old GNU one, with a SLOC 367 two POSIX-compliant versions and the old GNU one, with a SLOC
369 count in the 300s. And finally, (4) the modern GNU cut with 368 count in the 300s. And finally, (4) the modern GNU cut with
370 almost 600 SLOC. 369 almost 600 SLOC.