docs/cut

annotate cut.en.ms @ 41:e2961496d097

Add script to generate the PDFs
author markus schnalke <meillo@marmaro.de>
date Tue, 10 Nov 2015 21:13:22 +0100
parents 7608a7416bc0 ec76f8926598
children
rev   line source
meillo@27 1 .so macros
meillo@27 2 .lc_ctype en_US.utf8
meillo@34 3 .pl -3v
meillo@27 4
meillo@27 5 .TL
meillo@27 6 Cut out selected fields of each line of a file
meillo@27 7 .AU
meillo@27 8 markus schnalke <meillo@marmaro.de>
meillo@27 9 ..
meillo@27 10 .FS
meillo@27 11 2015-05.
meillo@34 12 This text is part of the public domain (CC0).
meillo@27 13 It is available online:
meillo@27 14 .I http://marmaro.de/docs/
meillo@27 15 .FE
meillo@27 16
meillo@27 17 .LP
meillo@27 18 Cut is a classic program in the Unix toolchest.
meillo@27 19 It is present in most tutorials on shell programming, because it
meillo@28 20 is such a nice and useful tool with good explanatory value.
meillo@28 21 This text shall take a look underneath its surface.
meillo@27 22 .SH
meillo@27 23 Usage
meillo@27 24 .LP
meillo@28 25 Initially, cut had two operation modes, which were later amended
meillo@28 26 by a third: The cut program may cut specified characters or bytes
meillo@28 27 out of the input lines or it may cut out specified fields, which
meillo@28 28 are defined by a delimiting character.
meillo@27 29 .PP
meillo@27 30 The character mode is well suited to slice fixed-width input
meillo@27 31 formats into parts. One might, for instance, extract the access
meillo@28 32 rights from the output of \f(CWls -l\fP, as shown here with the
meillo@28 33 rights of a file's owner:
meillo@27 34 .CS
meillo@27 35 $ ls -l foo
meillo@27 36 -rw-rw-r-- 1 meillo users 0 May 12 07:32 foo
meillo@27 37 .sp .3
meillo@27 38 $ ls -l foo | cut -c 2-4
meillo@27 39 rw-
meillo@27 40 .CE
meillo@27 41 .LP
meillo@28 42 Or the write permission for the owner, the group, and the
meillo@27 43 world:
meillo@27 44 .CS
meillo@27 45 $ ls -l foo | cut -c 3,6,9
meillo@27 46 ww-
meillo@27 47 .CE
meillo@27 48 .LP
meillo@27 49 Cut can also be used to shorten strings:
meillo@27 50 .CS
meillo@27 51 $ long=12345678901234567890
meillo@27 52 .sp .3
meillo@27 53 $ echo "$long" | cut -c -10
meillo@27 54 1234567890
meillo@27 55 .CE
meillo@27 56 .LP
meillo@27 57 This command outputs no more than the first 10 characters of
meillo@27 58 \f(CW$long\fP. (Alternatively, on could use \f(CWprintf
meillo@28 59 "%.10s\\n" "$long"\fP for this task.)
meillo@27 60 .PP
meillo@28 61 However, if it's not about displaying characters, but rather about
meillo@28 62 storing them, then \f(CW-c\fP is only partly suited. In former times,
meillo@28 63 when US-ASCII was the omnipresent character encoding, each
meillo@28 64 character was stored as exactly one byte. Therefore, \f(CWcut
meillo@28 65 -c\fP selected both output characters and bytes equally. With
meillo@27 66 the uprise of multi-byte encodings (like UTF-8), this assumption
meillo@27 67 became obsolete. Consequently, a byte mode (option \f(CW-b\fP)
meillo@28 68 was added to cut, with POSIX.2-1992. To select up to 500 bytes
meillo@28 69 from the beginning of each line (and ignore the rest), one can use:
meillo@27 70 .CS
meillo@27 71 $ cut -b -500
meillo@27 72 .CE
meillo@27 73 .LP
meillo@27 74 The remainder can be caught with \f(CWcut -b 501-\fP. This
meillo@30 75 use of cut is important for POSIX, because it provides a
meillo@33 76 transformation of text files with arbitrary line lengths to text
meillo@28 77 files with limited line length
meillo@27 78 .[[ http://pubs.opengroup.org/onlinepubs/9699919799/utilities/cut.html#tag_20_28_17 .
meillo@27 79 .PP
meillo@28 80 The introduction of the new byte mode essentially held the same
meillo@28 81 functionality as the old character mode. The character mode,
meillo@28 82 however, required a new, different implementation. In consequence,
meillo@28 83 the problem was not the support of the byte mode, but rather the
meillo@28 84 correct support of the new character mode.
meillo@27 85 .PP
meillo@28 86 Besides the character and byte modes, cut also offers a field
meillo@28 87 mode, which is activated by \f(CW-f\fP. It selects fields from
meillo@28 88 the input. The field-delimiter character for the input as well
meillo@28 89 as for the output (by default the tab) may be changed using
meillo@28 90 \f(CW-d\fP.
meillo@27 91 .PP
meillo@27 92 The typical example for the use of cut's field mode is the
meillo@31 93 selection of information from the password file. Here, for
meillo@28 94 instance, the usernames and their uids:
meillo@27 95 .CS
meillo@27 96 $ cut -d: -f1,3 /etc/passwd
meillo@27 97 root:0
meillo@27 98 bin:1
meillo@27 99 daemon:2
meillo@27 100 mail:8
meillo@27 101 ...
meillo@27 102 .CE
meillo@27 103 .LP
meillo@27 104 (The values to the command line switches may be appended directly
meillo@34 105 to them or separated by white\%space.)
meillo@27 106 .PP
meillo@33 107 The field mode is suited for simple tabular data, like the
meillo@31 108 password file. Beyond that, it soon reaches its limits. The typical
meillo@28 109 case of whitespace-separated fields, in particular, is covered
meillo@28 110 poorly by it. Cut's delimiter is exactly one character,
meillo@29 111 therefore one can not split at both space and tab characters.
meillo@27 112 Furthermore, multiple adjacent delimiter characters lead to
meillo@27 113 empty fields. This is not the expected behavior for
meillo@27 114 the processing of whitespace-separated fields. Some
meillo@27 115 implementations, e.g. the one of FreeBSD, have extensions that
meillo@29 116 handle this case in the expected way. On other systems or
meillo@29 117 to stay portable, awk comes to rescue.
meillo@27 118 .PP
meillo@28 119 Awk provides another functionality that cut lacks: Changing the order
meillo@27 120 of the fields in the output. For cut, the order of the field
meillo@27 121 selection specification is irrelevant; it doesn't even matter if
meillo@28 122 fields occur multiple times. Thus, the invocation
meillo@27 123 \f(CWcut -c 5-8,1,4-6\fP outputs the characters number
meillo@38 124 1, 4, 5, 6, 7, and 8 in ascending order. The
meillo@28 125 selection specification resembles mathematical set theory: Each
meillo@27 126 specified field is part of the solution set. The fields in the
meillo@27 127 solution set are always in the same order as in the input. To
meillo@27 128 speak with the words of the man page in Version 8 Unix:
meillo@27 129 ``In data base parlance, it projects a relation.''
meillo@27 130 .[[ http://man.cat-v.org/unix_8th/1/cut
meillo@28 131 This means that cut applies the \fIprojection\fP database operation
meillo@27 132 to the text input. Wikipedia explains it in the following way:
meillo@27 133 ``In practical terms, it can be roughly thought of as picking a
meillo@27 134 sub-set of all available columns.''
meillo@27 135 .[[ https://en.wikipedia.org/wiki/Projection_(relational_algebra)
meillo@27 136
meillo@27 137 .SH
meillo@27 138 Historical Background
meillo@27 139 .LP
meillo@27 140 Cut came to public life in 1982 with the release of UNIX System
meillo@27 141 III. Browsing through the sources of System III, one finds cut.c
meillo@27 142 with the timestamp 1980-04-11
meillo@27 143 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=SysIII/usr/src/cmd .
meillo@28 144 This is the oldest implementation of the program I was able to
meillo@28 145 discover. However, the SCCS-ID in the source code contains the
meillo@28 146 version number 1.5. According to Doug McIlroy
meillo@27 147 .[[ http://minnie.tuhs.org/pipermail/tuhs/2015-May/004083.html ,
meillo@28 148 the earlier history likely lies in PWB/UNIX, which was the
meillo@27 149 basis for System III. In the available sources of PWB 1.0 (1977)
meillo@27 150 .[[ http://minnie.tuhs.org/Archive/PDP-11/Distributions/usdl/ ,
meillo@27 151 no cut is present. Of PWB 2.0, no sources or useful documentation
meillo@27 152 seem to be available. PWB 3.0 was later renamed to System III
meillo@28 153 for marketing purposes only; it is otherwise identical to it. A
meillo@28 154 branch of PWB was CB UNIX, which was only used in the Bell Labs
meillo@27 155 internally. The manual of CB UNIX Edition 2.1 of November 1979
meillo@28 156 contains the earliest mention of cut that my research brought
meillo@28 157 to light, in the form of a man page
meillo@27 158 .[[ ftp://sunsite.icm.edu.pl/pub/unix/UnixArchive/PDP-11/Distributions/other/CB_Unix/cbunix_man1_02.pdf .
meillo@27 159 .PP
meillo@28 160 A look at BSD: There, my earliest discovery is a cut.c with
meillo@27 161 the file modification date of 1986-11-07
meillo@27 162 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-UWisc/src/usr.bin/cut
meillo@27 163 as part of the special version 4.3BSD-UWisc
meillo@27 164 .[[ http://gunkies.org/wiki/4.3_BSD_NFS_Wisconsin_Unix ,
meillo@27 165 which was released in January 1987.
meillo@27 166 This implementation is mostly identical to the one in System
meillo@27 167 III. The better known 4.3BSD-Tahoe (1988) does not contain cut.
meillo@28 168 The subsequent 4.3BSD-Reno (1990) does include cut. It is a freshly
meillo@27 169 written one by Adam S. Moskowitz and Marciano Pitargue, which was
meillo@27 170 included in BSD in 1989
meillo@27 171 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut .
meillo@27 172 Its man page
meillo@27 173 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut/cut.1
meillo@27 174 already mentions the expected compliance to POSIX.2.
meillo@27 175 One should note that POSIX.2 was first published in
meillo@27 176 September 1992, about two years after the man page and the
meillo@27 177 program were written. Hence, the program must have been
meillo@27 178 implemented based on a draft version of the standard. A look into
meillo@27 179 the code confirms the assumption. The function to parse the field
meillo@27 180 selection includes the following comment:
meillo@27 181 .QP
meillo@27 182 This parser is less restrictive than the Draft 9 POSIX spec.
meillo@27 183 POSIX doesn't allow lists that aren't in increasing order or
meillo@27 184 overlapping lists.
meillo@27 185 .LP
meillo@27 186 Draft 11.2 of POSIX (1991-09) requires this flexibility already:
meillo@27 187 .QP
meillo@27 188 The elements in list can be repeated, can overlap, and can
meillo@27 189 be specified in any order.
meillo@27 190 .LP
meillo@27 191 The same draft additionally includes all three operation modes,
meillo@27 192 whereas this early BSD cut only implemented the original two.
meillo@27 193 Draft 9 might not have included the byte mode. Without access to
meillo@27 194 Draft 9 or 10, it wasn't possible to verify this guess.
meillo@27 195 .PP
meillo@27 196 The version numbers and change dates of the older BSD
meillo@27 197 implementations are manifested in the SCCS-IDs, which the
meillo@27 198 version control system of that time inserted. For instance
meillo@27 199 in 4.3BSD-Reno: ``5.3 (Berkeley) 6/24/90''.
meillo@27 200 .PP
meillo@27 201 The cut implementation of the GNU coreutils contains the
meillo@27 202 following copyright notice:
meillo@27 203 .CS
meillo@27 204 Copyright (C) 1997-2015 Free Software Foundation, Inc.
meillo@27 205 Copyright (C) 1984 David M. Ihnat
meillo@27 206 .CE
meillo@27 207 .LP
meillo@31 208 This code does have old origins. Further comments show that
meillo@27 209 the source code was reworked by David MacKenzie first and later
meillo@27 210 by Jim Meyering, who put it into the version control system in
meillo@28 211 1992. It is unclear why the years until 1997, at least from
meillo@28 212 1992 onwards, don't show up in the copyright notice.
meillo@27 213 .PP
meillo@27 214 Despite all those year numbers from the 80s, cut is a rather
meillo@27 215 young tool, at least in relation to the early Unix. Despite
meillo@28 216 being a decade older than Linux (the kernel), Unix was present
meillo@31 217 for over ten years already by the time cut appeared for the first
meillo@27 218 time. Most notably, cut wasn't part of Version 7 Unix, which
meillo@27 219 became the basis for all modern Unix systems. The more complex
meillo@28 220 tools sed and awk were part of it already. Hence, the
meillo@28 221 question comes to mind why cut was written at all, as two
meillo@31 222 programs already existed that were able to cover its use
meillo@31 223 cases. One reason for cut surely was its compactness and the
meillo@28 224 resulting speed, in comparison to the then-bulky awk. This lean
meillo@33 225 shape goes well with the Unix philosophy: Do one job and do it
meillo@28 226 well! Cut was sufficiently convincing. It found its way to
meillo@28 227 other Unix variants, it became standardized, and today it is
meillo@28 228 present everywhere.
meillo@27 229 .PP
meillo@28 230 The original variant (without \f(CW-b\fP) was described already
meillo@28 231 in 1985, by the System V Interface Definition, an important
meillo@28 232 formal description of UNIX System V. In the following years, it
meillo@28 233 appeared in all relevant standards. POSIX.2 specified cut for
meillo@28 234 the first time in its modern form (with \f(CW-b\fP) in 1992.
meillo@27 235
meillo@34 236 .pl -1v
meillo@27 237 .SH
meillo@27 238 Multi-byte support
meillo@27 239 .LP
meillo@28 240 The byte mode and thus the multi-byte support of the POSIX
meillo@29 241 character mode have been standardized since 1992. But are
meillo@29 242 they present in the available implementations? Which versions
meillo@29 243 implement POSIX correctly?
meillo@27 244 .PP
meillo@28 245 The situation is divided into three parts: There are historic
meillo@27 246 implementations, which have only \f(CW-c\fP and \f(CW-f\fP.
meillo@28 247 Then there are implementations that have \f(CW-b\fP, but
meillo@27 248 treat it as an alias for \f(CW-c\fP only. These
meillo@27 249 implementations work correctly for single-byte encodings
meillo@34 250 (e.g. US-ASCII, Latin1) but for multi-byte en\%codings (e.g.
meillo@27 251 UTF-8) their \f(CW-c\fP behaves like \f(CW-b\fP (and
meillo@27 252 \f(CW-n\fP is ignored). Finally, there are implementations
meillo@28 253 that implement \f(CW-c\fP and \f(CW-b\fP in a POSIX-compliant
meillo@28 254 way.
meillo@27 255 .PP
meillo@29 256 Historic two-mode implementations are the ones of
meillo@31 257 System III, System V, and the BSD ones until the mid-90s.
meillo@27 258 .PP
meillo@28 259 Pseudo multi-byte implementations are provided by GNU,
meillo@28 260 modern NetBSD, and modern OpenBSD. The level of POSIX compliance
meillo@27 261 that is presented there is often higher than the level of
meillo@27 262 compliance that is actually provided. Sometimes it takes a
meillo@27 263 close look to discover that \f(CW-c\fP and \f(CW-n\fP don't
meillo@27 264 behave as expected. Some of the implementations take the
meillo@27 265 easy way by simply being ignorant to any multi-byte
meillo@28 266 encodings, at least they declare that clearly:
meillo@27 267 .QP
meillo@28 268 Since we don't support multi-byte characters, the \f(CW-c\fP
meillo@28 269 and \f(CW-b\fP options are equivalent, and the \f(CW-n\fP
meillo@28 270 option is meaningless.
meillo@27 271 .[[ http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/cut/cut.c?rev=1.18&content-type=text/x-cvsweb-markup
meillo@27 272 .LP
meillo@31 273 Standard-adhering implementations, i.e. ones that treat
meillo@31 274 multi-byte characters correctly, are those of the modern
meillo@31 275 FreeBSD and the Heirloom toolchest. Tim Robbins
meillo@27 276 reimplemented the character mode of FreeBSD cut,
meillo@28 277 conforming to POSIX, in the summer of 2004
meillo@27 278 .[[ https://svnweb.freebsd.org/base?view=revision&revision=131194 .
meillo@28 279 The question why the other BSD systems have not
meillo@33 280 integrated this change is an open one. Maybe the answer is
meillo@33 281 a general ignorance of internationalization.
meillo@27 282 .PP
meillo@31 283 How do users find out if the cut on their own system handles
meillo@28 284 multi-byte characters correctly? First, one needs to check if
meillo@27 285 the system itself uses multi-byte characters, because otherwise
meillo@27 286 characters and bytes are equivalent and the question
meillo@27 287 is irrelevant. One can check this by looking at the locale
meillo@27 288 settings, but it is easier to print a typical multi-byte
meillo@27 289 character, for instance an Umlaut or the Euro currency
meillo@28 290 symbol, and check if one or more bytes are generated as
meillo@28 291 output:
meillo@27 292 .CS
meillo@27 293 $ echo ä | od -c
meillo@27 294 0000000 303 244 \\n
meillo@27 295 0000003
meillo@27 296 .CE
meillo@27 297 .LP
meillo@28 298 In this case it resulted in two bytes: octal 303 and 244. (The
meillo@28 299 newline character is added by echo.)
meillo@27 300 .PP
meillo@27 301 The program iconv converts text to specific encodings. This
meillo@27 302 is the output for Latin1 and UTF-8, for comparison:
meillo@27 303 .CS
meillo@27 304 $ echo ä | iconv -t latin1 | od -c
meillo@27 305 0000000 344 \\n
meillo@27 306 0000002
meillo@27 307 .sp .3
meillo@27 308 $ echo ä | iconv -t utf8 | od -c
meillo@27 309 0000000 303 244 \\n
meillo@27 310 0000003
meillo@27 311 .CE
meillo@27 312 .LP
meillo@27 313 The output (without the iconv conversion) on many European
meillo@27 314 systems equals one of these two.
meillo@27 315 .PP
meillo@28 316 Now for the test of the cut implementation. On a UTF-8 system, a
meillo@28 317 POSIX-compliant implementation behaves as such:
meillo@27 318 .CS
meillo@27 319 $ echo ä | cut -c 1 | od -c
meillo@27 320 0000000 303 244 \\n
meillo@27 321 0000003
meillo@27 322 .sp .3
meillo@27 323 $ echo ä | cut -b 1 | od -c
meillo@27 324 0000000 303 \\n
meillo@27 325 0000002
meillo@27 326 .sp .3
meillo@27 327 $ echo ä | cut -b 1 -n | od -c
meillo@27 328 0000000 \\n
meillo@27 329 0000001
meillo@27 330 .CE
meillo@27 331 .LP
meillo@28 332 A pseudo-POSIX implementation, in contrast, behaves like the
meillo@28 333 middle one for all three invocations: Only the first byte is
meillo@28 334 printed as output.
meillo@27 335
meillo@27 336 .SH
meillo@27 337 Implementations
meillo@27 338 .LP
meillo@27 339 Let's take a look at the sources of a selection of
meillo@27 340 implementations.
meillo@27 341 .PP
meillo@27 342 A comparison of the amount of source code is good to get a first
meillo@28 343 impression. Typically, it grows through time. This can generally
meillo@28 344 be seen here, but not in all cases. A POSIX-compliant
meillo@27 345 implementation of the character mode requires more code, thus
meillo@28 346 these implementations tend to be the larger ones.
meillo@27 347 .TS
meillo@27 348 center;
meillo@27 349 r r r l l l.
meillo@31 350 SLOC Lines Bytes Belongs to File time Category
meillo@27 351 _
meillo@27 352 116 123 2966 System III 1980-04-11 historic
meillo@27 353 118 125 3038 4.3BSD-UWisc 1986-11-07 historic
meillo@27 354 200 256 5715 4.3BSD-Reno 1990-06-25 historic
meillo@27 355 200 270 6545 NetBSD 1993-03-21 historic
meillo@27 356 218 290 6892 OpenBSD 2008-06-27 pseudo-POSIX
meillo@27 357 224 296 6920 FreeBSD 1994-05-27 historic
meillo@27 358 232 306 7500 NetBSD 2014-02-03 pseudo-POSIX
meillo@27 359 340 405 7423 Heirloom 2012-05-20 POSIX
meillo@27 360 382 586 14175 GNU coreutils 1992-11-08 pseudo-POSIX
meillo@27 361 391 479 10961 FreeBSD 2012-11-24 POSIX
meillo@27 362 588 830 23167 GNU coreutils 2015-05-01 pseudo-POSIX
meillo@27 363 .TE
meillo@27 364 .LP
meillo@31 365 There are four rough groups: (1) The two original
meillo@28 366 implementations, which are mostly identical, with about 100
meillo@27 367 SLOC. (2) The five BSD versions, with about 200 SLOC. (3) The
meillo@27 368 two POSIX-compliant versions and the old GNU one, with a SLOC
meillo@28 369 count in the 300s. And finally, (4) the modern GNU cut with
meillo@27 370 almost 600 SLOC.
meillo@27 371 .PP
meillo@27 372 The variation between the number of logical code
meillo@28 373 lines (SLOC, measured with SLOCcount) and the number of
meillo@28 374 newlines in the file (\f(CWwc -l\fP) spans between factor
meillo@27 375 1.06 for the oldest versions and factor 1.5 for GNU. The
meillo@28 376 largest influence on it are empty lines, pure comment lines,
meillo@27 377 and the size of the license block at the beginning of the file.
meillo@27 378 .PP
meillo@27 379 Regarding the variation between logical code lines and the
meillo@27 380 file size (\f(CWwc -c\fP), the implementations span between
meillo@27 381 25 and 30 bytes per statement. With only 21 bytes per
meillo@27 382 statement, the Heirloom implementation marks the lower end;
meillo@28 383 the GNU implementation sets the upper limit at nearly 40 bytes. In
meillo@27 384 the case of GNU, the reason is mainly their coding style, with
meillo@28 385 special indentation rules and long identifiers. Whether one finds
meillo@27 386 the Heirloom implementation
meillo@27 387 .[[ http://heirloom.cvs.sourceforge.net/viewvc/heirloom/heirloom/cut/cut.c?revision=1.6&view=markup
meillo@28 388 highly cryptic or exceptionally elegant shall be left
meillo@28 389 to the judgement of the reader. Especially the
meillo@27 390 comparison to the GNU implementation
meillo@27 391 .[[ http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/cut.c;hb=e981643
meillo@27 392 is impressive.
meillo@27 393 .PP
meillo@27 394 The internal structure of the source code (in all cases it is
meillo@27 395 written in C) is mainly similar. Besides the mandatory main
meillo@27 396 function, which does the command line argument processing,
meillo@28 397 there usually is a function to convert the field
meillo@27 398 selection specification to an internal data structure.
meillo@28 399 Furthermore, almost all implementations have separate
meillo@27 400 functions for each of their operation modes. The POSIX-compliant
meillo@27 401 versions treat the \f(CW-b -n\fP combination as a separate
meillo@28 402 mode and thus implement it in a separate function. Only the early
meillo@27 403 System III implementation (and its 4.3BSD-UWisc variant) do
meillo@27 404 everything, apart from error handling, in the main function.
meillo@27 405 .PP
meillo@27 406 Implementations of cut typically have two limiting aspects:
meillo@27 407 One being the maximum number of fields that can be handled,
meillo@27 408 the other being the maximum line length. On System III, both
meillo@27 409 numbers are limited to 512. 4.3BSD-Reno and the BSDs of the
meillo@27 410 90s have fixed limits as well (\f(CW_BSD_LINE_MAX\fP or
meillo@28 411 \f(CW_POSIX2_LINE_MAX\fP). Modern FreeBSD, modern NetBSD, all GNU
meillo@28 412 implementations, and the Heirloom cut are able to handle
meillo@27 413 arbitrary numbers of fields and line lengths \(en the memory
meillo@27 414 is allocated dynamically. OpenBSD cut is a hybrid: It has a fixed
meillo@27 415 maximum number of fields, but allows arbitrary line lengths.
meillo@28 416 The limited number of fields does not, however, appear to be
meillo@27 417 any practical problem, because \f(CW_POSIX2_LINE_MAX\fP is
meillo@27 418 guaranteed to be at least 2048 and is thus probably large enough.
meillo@27 419
meillo@27 420 .SH
meillo@27 421 Descriptions
meillo@27 422 .LP
meillo@27 423 Interesting, as well, is a comparison of the short descriptions
meillo@27 424 of cut, as can be found in the headlines of the man
meillo@27 425 pages or at the beginning of the source code files.
meillo@28 426 The following list is roughly grouped by origin:
meillo@27 427 .TS
meillo@27 428 center;
meillo@27 429 l l.
meillo@27 430 CB UNIX cut out selected fields of each line of a file
meillo@27 431 System III cut out selected fields of each line of a file
meillo@27 432 System III \(dg cut and paste columns of a table (projection of a relation)
meillo@27 433 System V cut out selected fields of each line of a file
meillo@27 434 HP-UX cut out (extract) selected fields of each line of a file
meillo@27 435 .sp .3
meillo@27 436 4.3BSD-UWisc \(dg cut and paste columns of a table (projection of a relation)
meillo@27 437 4.3BSD-Reno select portions of each line of a file
meillo@27 438 NetBSD select portions of each line of a file
meillo@27 439 OpenBSD 4.6 select portions of each line of a file
meillo@27 440 FreeBSD 1.0 select portions of each line of a file
meillo@27 441 FreeBSD 10.0 cut out selected portions of each line of a file
meillo@27 442 SunOS 4.1.3 remove selected fields from each line of a file
meillo@27 443 SunOS 5.5.1 cut out selected fields of each line of a file
meillo@27 444 .sp .3
meillo@27 445 Heirloom Tools cut out selected fields of each line of a file
meillo@27 446 Heirloom Tools \(dg cut out fields of lines of files
meillo@27 447 .sp .3
meillo@27 448 GNU coreutils remove sections from each line of files
meillo@27 449 .sp .3
meillo@27 450 Minix select out columns of a file
meillo@27 451 .sp .3
meillo@27 452 Version 8 Unix rearrange columns of data
meillo@27 453 ``Unix Reader'' rearrange columns of text
meillo@27 454 .sp .3
meillo@27 455 POSIX cut out selected fields of each line of a file
meillo@27 456 .TE
meillo@27 457 .LP
meillo@27 458 (The descriptions that are marked with `\(dg' were taken from
meillo@27 459 source code files. The POSIX entry contains the description
meillo@27 460 used in the standard. The ``Unix Reader'' is a retrospective
meillo@27 461 document by Doug McIlroy, which lists the availability of
meillo@27 462 tools in the Research Unix versions
meillo@27 463 .[[ http://doc.cat-v.org/unix/unix-reader/contents.pdf .
meillo@27 464 Its description should actually match the one in Version 8
meillo@27 465 Unix. The change could be a transfer mistake or a correction.
meillo@27 466 All other descriptions originate from the various man pages.)
meillo@27 467 .PP
meillo@27 468 Over time, the POSIX description was often adopted or it
meillo@27 469 served as inspiration. One such example is FreeBSD
meillo@27 470 .[[ https://svnweb.freebsd.org/base?view=revision&revision=167101 .
meillo@27 471 .PP
meillo@27 472 It is noteworthy that the GNU coreutils in all versions
meillo@27 473 describe the performed action as a removal of parts of the
meillo@28 474 input, although the user clearly selects the parts that then
meillo@37 475 constitute the output. Probably the words ``cut out'' are too
meillo@28 476 misleading. HP-UX tried to be more clear.
meillo@27 477 .PP
meillo@28 478 Different terms are also used for the part being
meillo@27 479 selected. Some talk about fields (POSIX), some talk
meillo@27 480 about portions (BSD) and some call it columns (Research
meillo@27 481 Unix).
meillo@27 482 .PP
meillo@27 483 The seemingly least adequate description, the one of Version
meillo@27 484 8 Unix (``rearrange columns of data'') is explainable in so
meillo@39 485 far as that the man page covers both cut and paste, and in
meillo@39 486 their combination, columns can be rearranged. The use of the
meillo@39 487 word ``data'' instead of ``text'' might be a lapse, which
meillo@39 488 McIlroy corrected in his Unix Reader ... but on the other hand,
meillo@39 489 on Unix, the two words are mostly synonymous, because all data
meillo@27 490 is text.
meillo@27 491
meillo@27 492
meillo@27 493 .SH
meillo@28 494 References
meillo@27 495 .LP
meillo@27 496 .nf
meillo@27 497 ._r
meillo@27 498