docs/cut: 5f78bcd34eeb cut.en.ms

docs/cut

view cut.en.ms @ 32:5f78bcd34eeb

added missing letter

author	markus schnalke <meillo@marmaro.de>
date	Fri, 18 Sep 2015 10:28:29 +0200
parents	106609b64dc4
children	a1589fcfe9f4

line source

1 .so macros

2 .lc_ctype en_US.utf8

3 .pl -4v

5 .TL

6 Cut out selected fields of each line of a file

7 .AU

8 markus schnalke <meillo@marmaro.de>

9 ..

10 .FS

11 2015-05.

12 This text is in the public domain (CC0).

13 It is available online:

14 .I http://marmaro.de/docs/

15 .FE

17 .LP

18 Cut is a classic program in the Unix toolchest.

19 It is present in most tutorials on shell programming, because it

20 is such a nice and useful tool with good explanatory value.

21 This text shall take a look underneath its surface.

22 .SH

23 Usage

24 .LP

25 Initially, cut had two operation modes, which were later amended

26 by a third: The cut program may cut specified characters or bytes

27 out of the input lines or it may cut out specified fields, which

28 are defined by a delimiting character.

29 .PP

30 The character mode is well suited to slice fixed-width input

31 formats into parts. One might, for instance, extract the access

32 rights from the output of \f(CWls -l\fP, as shown here with the

33 rights of a file's owner:

34 .CS

35 $ ls -l foo

36 -rw-rw-r-- 1 meillo users 0 May 12 07:32 foo

37 .sp .3

38 $ ls -l foo | cut -c 2-4

39 rw-

40 .CE

41 .LP

42 Or the write permission for the owner, the group, and the

43 world:

44 .CS

45 $ ls -l foo | cut -c 3,6,9

46 ww-

47 .CE

48 .LP

49 Cut can also be used to shorten strings:

50 .CS

51 $ long=12345678901234567890

52 .sp .3

53 $ echo "$long" | cut -c -10

54 1234567890

55 .CE

56 .LP

57 This command outputs no more than the first 10 characters of

58 \f(CW$long\fP. (Alternatively, on could use \f(CWprintf

59 "%.10s\\n" "$long"\fP for this task.)

60 .PP

61 However, if it's not about displaying characters, but rather about

62 storing them, then \f(CW-c\fP is only partly suited. In former times,

63 when US-ASCII was the omnipresent character encoding, each

64 character was stored as exactly one byte. Therefore, \f(CWcut

65 -c\fP selected both output characters and bytes equally. With

66 the uprise of multi-byte encodings (like UTF-8), this assumption

67 became obsolete. Consequently, a byte mode (option \f(CW-b\fP)

68 was added to cut, with POSIX.2-1992. To select up to 500 bytes

69 from the beginning of each line (and ignore the rest), one can use:

70 .CS

71 $ cut -b -500

72 .CE

73 .LP

74 The remainder can be caught with \f(CWcut -b 501-\fP. This

75 use of cut is important for POSIX, because it provides a

76 transformation of text files with arbitrary line lenghts to text

77 files with limited line length

78 .[[ http://pubs.opengroup.org/onlinepubs/9699919799/utilities/cut.html#tag_20_28_17 .

79 .PP

80 The introduction of the new byte mode essentially held the same

81 functionality as the old character mode. The character mode,

82 however, required a new, different implementation. In consequence,

83 the problem was not the support of the byte mode, but rather the

84 correct support of the new character mode.

85 .PP

86 Besides the character and byte modes, cut also offers a field

87 mode, which is activated by \f(CW-f\fP. It selects fields from

88 the input. The field-delimiter character for the input as well

89 as for the output (by default the tab) may be changed using

90 \f(CW-d\fP.

91 .PP

92 The typical example for the use of cut's field mode is the

93 selection of information from the password file. Here, for

94 instance, the usernames and their uids:

95 .CS

96 $ cut -d: -f1,3 /etc/passwd

97 root:0

98 bin:1

99 daemon:2

100 mail:8

101 ...

102 .CE

103 .LP

104 (The values to the command line switches may be appended directly

105 to them or separated by whitespace.)

106 .PP

107 The field mode is suited for simple tabulary data, like the

108 password file. Beyond that, it soon reaches its limits. The typical

109 case of whitespace-separated fields, in particular, is covered

110 poorly by it. Cut's delimiter is exactly one character,

111 therefore one can not split at both space and tab characters.

112 Furthermore, multiple adjacent delimiter characters lead to

113 empty fields. This is not the expected behavior for

114 the processing of whitespace-separated fields. Some

115 implementations, e.g. the one of FreeBSD, have extensions that

116 handle this case in the expected way. On other systems or

117 to stay portable, awk comes to rescue.

118 .PP

119 Awk provides another functionality that cut lacks: Changing the order

120 of the fields in the output. For cut, the order of the field

121 selection specification is irrelevant; it doesn't even matter if

122 fields occur multiple times. Thus, the invocation

123 \f(CWcut -c 5-8,1,4-6\fP outputs the characters number

124 1, 4, 5, 6, 7, and 8 in exactly this order. The

125 selection specification resembles mathematical set theory: Each

126 specified field is part of the solution set. The fields in the

127 solution set are always in the same order as in the input. To

128 speak with the words of the man page in Version 8 Unix:

129 ``In data base parlance, it projects a relation.''

130 .[[ http://man.cat-v.org/unix_8th/1/cut

131 This means that cut applies the \fIprojection\fP database operation

132 to the text input. Wikipedia explains it in the following way:

133 ``In practical terms, it can be roughly thought of as picking a

134 sub-set of all available columns.''

135 .[[ https://en.wikipedia.org/wiki/Projection_(relational_algebra)

136

137 .SH

138 Historical Background

139 .LP

140 Cut came to public life in 1982 with the release of UNIX System

141 III. Browsing through the sources of System III, one finds cut.c

142 with the timestamp 1980-04-11

143 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=SysIII/usr/src/cmd .

144 This is the oldest implementation of the program I was able to

145 discover. However, the SCCS-ID in the source code contains the

146 version number 1.5. According to Doug McIlroy

147 .[[ http://minnie.tuhs.org/pipermail/tuhs/2015-May/004083.html ,

148 the earlier history likely lies in PWB/UNIX, which was the

149 basis for System III. In the available sources of PWB 1.0 (1977)

150 .[[ http://minnie.tuhs.org/Archive/PDP-11/Distributions/usdl/ ,

151 no cut is present. Of PWB 2.0, no sources or useful documentation

152 seem to be available. PWB 3.0 was later renamed to System III

153 for marketing purposes only; it is otherwise identical to it. A

154 branch of PWB was CB UNIX, which was only used in the Bell Labs

155 internally. The manual of CB UNIX Edition 2.1 of November 1979

156 contains the earliest mention of cut that my research brought

157 to light, in the form of a man page

158 .[[ ftp://sunsite.icm.edu.pl/pub/unix/UnixArchive/PDP-11/Distributions/other/CB_Unix/cbunix_man1_02.pdf .

159 .PP

160 A look at BSD: There, my earliest discovery is a cut.c with

161 the file modification date of 1986-11-07

162 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-UWisc/src/usr.bin/cut

163 as part of the special version 4.3BSD-UWisc

164 .[[ http://gunkies.org/wiki/4.3_BSD_NFS_Wisconsin_Unix ,

165 which was released in January 1987.

166 This implementation is mostly identical to the one in System

167 III. The better known 4.3BSD-Tahoe (1988) does not contain cut.

168 The subsequent 4.3BSD-Reno (1990) does include cut. It is a freshly

169 written one by Adam S. Moskowitz and Marciano Pitargue, which was

170 included in BSD in 1989

171 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut .

172 Its man page

173 .[[ http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.3BSD-Reno/src/usr.bin/cut/cut.1

174 already mentions the expected compliance to POSIX.2.

175 One should note that POSIX.2 was first published in

176 September 1992, about two years after the man page and the

177 program were written. Hence, the program must have been

178 implemented based on a draft version of the standard. A look into

179 the code confirms the assumption. The function to parse the field

180 selection includes the following comment:

181 .QP

182 This parser is less restrictive than the Draft 9 POSIX spec.

183 POSIX doesn't allow lists that aren't in increasing order or

184 overlapping lists.

185 .LP

186 Draft 11.2 of POSIX (1991-09) requires this flexibility already:

187 .QP

188 The elements in list can be repeated, can overlap, and can

189 be specified in any order.

190 .LP

191 The same draft additionally includes all three operation modes,

192 whereas this early BSD cut only implemented the original two.

193 Draft 9 might not have included the byte mode. Without access to

194 Draft 9 or 10, it wasn't possible to verify this guess.

195 .PP

196 The version numbers and change dates of the older BSD

197 implementations are manifested in the SCCS-IDs, which the

198 version control system of that time inserted. For instance

199 in 4.3BSD-Reno: ``5.3 (Berkeley) 6/24/90''.

200 .PP

201 The cut implementation of the GNU coreutils contains the

202 following copyright notice:

203 .CS

206 .CE

207 .LP

208 This code does have old origins. Further comments show that

209 the source code was reworked by David MacKenzie first and later

210 by Jim Meyering, who put it into the version control system in

211 1992. It is unclear why the years until 1997, at least from

212 1992 onwards, don't show up in the copyright notice.

213 .PP

214 Despite all those year numbers from the 80s, cut is a rather

215 young tool, at least in relation to the early Unix. Despite

216 being a decade older than Linux (the kernel), Unix was present

217 for over ten years already by the time cut appeared for the first

218 time. Most notably, cut wasn't part of Version 7 Unix, which

219 became the basis for all modern Unix systems. The more complex

220 tools sed and awk were part of it already. Hence, the

221 question comes to mind why cut was written at all, as two

222 programs already existed that were able to cover its use

223 cases. One reason for cut surely was its compactness and the

224 resulting speed, in comparison to the then-bulky awk. This lean

225 shape goes well with the Unix philosopy: Do one job and do it

226 well! Cut was sufficiently convincing. It found its way to

227 other Unix variants, it became standardized, and today it is

228 present everywhere.

229 .PP

230 The original variant (without \f(CW-b\fP) was described already

231 in 1985, by the System V Interface Definition, an important

232 formal description of UNIX System V. In the following years, it

233 appeared in all relevant standards. POSIX.2 specified cut for

234 the first time in its modern form (with \f(CW-b\fP) in 1992.

235

236 .SH

237 Multi-byte support

238 .LP

239 The byte mode and thus the multi-byte support of the POSIX

240 character mode have been standardized since 1992. But are

241 they present in the available implementations? Which versions

242 implement POSIX correctly?

243 .PP

244 The situation is divided into three parts: There are historic

245 implementations, which have only \f(CW-c\fP and \f(CW-f\fP.

246 Then there are implementations that have \f(CW-b\fP, but

247 treat it as an alias for \f(CW-c\fP only. These

248 implementations work correctly for single-byte encodings

249 (e.g. US-ASCII, Latin1) but for multi-byte encodings (e.g.

250 UTF-8) their \f(CW-c\fP behaves like \f(CW-b\fP (and

251 \f(CW-n\fP is ignored). Finally, there are implementations

252 that implement \f(CW-c\fP and \f(CW-b\fP in a POSIX-compliant

253 way.

254 .PP

255 Historic two-mode implementations are the ones of

256 System III, System V, and the BSD ones until the mid-90s.

257 .PP

258 Pseudo multi-byte implementations are provided by GNU,

259 modern NetBSD, and modern OpenBSD. The level of POSIX compliance

260 that is presented there is often higher than the level of

261 compliance that is actually provided. Sometimes it takes a

262 close look to discover that \f(CW-c\fP and \f(CW-n\fP don't

263 behave as expected. Some of the implementations take the

264 easy way by simply being ignorant to any multi-byte

265 encodings, at least they declare that clearly:

266 .QP

267 Since we don't support multi-byte characters, the \f(CW-c\fP

268 and \f(CW-b\fP options are equivalent, and the \f(CW-n\fP

269 option is meaningless.

270 .[[ http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/cut/cut.c?rev=1.18&content-type=text/x-cvsweb-markup

271 .LP

272 Standard-adhering implementations, i.e. ones that treat

273 multi-byte characters correctly, are those of the modern

274 FreeBSD and the Heirloom toolchest. Tim Robbins

275 reimplemented the character mode of FreeBSD cut,

276 conforming to POSIX, in the summer of 2004

277 .[[ https://svnweb.freebsd.org/base?view=revision&revision=131194 .

278 The question why the other BSD systems have not

279 integrated this change is an open one. Maybe the answer can be

280 found in the above quoted statement.

281 .PP

282 How do users find out if the cut on their own system handles

283 multi-byte characters correctly? First, one needs to check if

284 the system itself uses multi-byte characters, because otherwise

285 characters and bytes are equivalent and the question

286 is irrelevant. One can check this by looking at the locale

287 settings, but it is easier to print a typical multi-byte

288 character, for instance an Umlaut or the Euro currency

289 symbol, and check if one or more bytes are generated as

290 output:

291 .CS

292 $ echo ä | od -c

293 0000000 303 244 \\n

294 0000003

295 .CE

296 .LP

297 In this case it resulted in two bytes: octal 303 and 244. (The

298 newline character is added by echo.)

299 .PP

300 The program iconv converts text to specific encodings. This

301 is the output for Latin1 and UTF-8, for comparison:

302 .CS

303 $ echo ä | iconv -t latin1 | od -c

304 0000000 344 \\n

305 0000002

306 .sp .3

307 $ echo ä | iconv -t utf8 | od -c

308 0000000 303 244 \\n

309 0000003

310 .CE

311 .LP

312 The output (without the iconv conversion) on many European

313 systems equals one of these two.

314 .PP

315 Now for the test of the cut implementation. On a UTF-8 system, a

316 POSIX-compliant implementation behaves as such:

317 .CS

318 $ echo ä | cut -c 1 | od -c

319 0000000 303 244 \\n

320 0000003

321 .sp .3

322 $ echo ä | cut -b 1 | od -c

323 0000000 303 \\n

324 0000002

325 .sp .3

326 $ echo ä | cut -b 1 -n | od -c

327 0000000 \\n

328 0000001

329 .CE

330 .LP

331 A pseudo-POSIX implementation, in contrast, behaves like the

332 middle one for all three invocations: Only the first byte is

333 printed as output.

334

335 .SH

336 Implementations

337 .LP

338 Let's take a look at the sources of a selection of

339 implementations.

340 .PP

341 A comparison of the amount of source code is good to get a first

342 impression. Typically, it grows through time. This can generally

343 be seen here, but not in all cases. A POSIX-compliant

344 implementation of the character mode requires more code, thus

345 these implementations tend to be the larger ones.

346 .TS

347 center;

348 r r r l l l.

349 SLOC Lines Bytes Belongs to File time Category

350 _

351 116 123 2966 System III 1980-04-11 historic

352 118 125 3038 4.3BSD-UWisc 1986-11-07 historic

353 200 256 5715 4.3BSD-Reno 1990-06-25 historic

354 200 270 6545 NetBSD 1993-03-21 historic

355 218 290 6892 OpenBSD 2008-06-27 pseudo-POSIX

356 224 296 6920 FreeBSD 1994-05-27 historic

357 232 306 7500 NetBSD 2014-02-03 pseudo-POSIX

358 340 405 7423 Heirloom 2012-05-20 POSIX

359 382 586 14175 GNU coreutils 1992-11-08 pseudo-POSIX

360 391 479 10961 FreeBSD 2012-11-24 POSIX

361 588 830 23167 GNU coreutils 2015-05-01 pseudo-POSIX

362 .TE

363 .LP

364 There are four rough groups: (1) The two original

365 implementations, which are mostly identical, with about 100

366 SLOC. (2) The five BSD versions, with about 200 SLOC. (3) The

367 two POSIX-compliant versions and the old GNU one, with a SLOC

368 count in the 300s. And finally, (4) the modern GNU cut with

369 almost 600 SLOC.

370 .PP

371 The variation between the number of logical code

372 lines (SLOC, measured with SLOCcount) and the number of

373 newlines in the file (\f(CWwc -l\fP) spans between factor

374 1.06 for the oldest versions and factor 1.5 for GNU. The

375 largest influence on it are empty lines, pure comment lines,

376 and the size of the license block at the beginning of the file.

377 .PP

378 Regarding the variation between logical code lines and the

379 file size (\f(CWwc -c\fP), the implementations span between

380 25 and 30 bytes per statement. With only 21 bytes per

381 statement, the Heirloom implementation marks the lower end;

382 the GNU implementation sets the upper limit at nearly 40 bytes. In

383 the case of GNU, the reason is mainly their coding style, with

384 special indentation rules and long identifiers. Whether one finds

385 the Heirloom implementation

386 .[[ http://heirloom.cvs.sourceforge.net/viewvc/heirloom/heirloom/cut/cut.c?revision=1.6&view=markup

387 highly cryptic or exceptionally elegant shall be left

388 to the judgement of the reader. Especially the

389 comparison to the GNU implementation

390 .[[ http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/cut.c;hb=e981643

391 is impressive.

392 .PP

393 The internal structure of the source code (in all cases it is

394 written in C) is mainly similar. Besides the mandatory main

395 function, which does the command line argument processing,

396 there usually is a function to convert the field

397 selection specification to an internal data structure.

398 Furthermore, almost all implementations have separate

399 functions for each of their operation modes. The POSIX-compliant

400 versions treat the \f(CW-b -n\fP combination as a separate

401 mode and thus implement it in a separate function. Only the early

402 System III implementation (and its 4.3BSD-UWisc variant) do

403 everything, apart from error handling, in the main function.

404 .PP

405 Implementations of cut typically have two limiting aspects:

406 One being the maximum number of fields that can be handled,

407 the other being the maximum line length. On System III, both

408 numbers are limited to 512. 4.3BSD-Reno and the BSDs of the

409 90s have fixed limits as well (\f(CW_BSD_LINE_MAX\fP or

410 \f(CW_POSIX2_LINE_MAX\fP). Modern FreeBSD, modern NetBSD, all GNU

411 implementations, and the Heirloom cut are able to handle

412 arbitrary numbers of fields and line lengths \(en the memory

413 is allocated dynamically. OpenBSD cut is a hybrid: It has a fixed

414 maximum number of fields, but allows arbitrary line lengths.

415 The limited number of fields does not, however, appear to be

416 any practical problem, because \f(CW_POSIX2_LINE_MAX\fP is

417 guaranteed to be at least 2048 and is thus probably large enough.

418

419 .SH

420 Descriptions

421 .LP

422 Interesting, as well, is a comparison of the short descriptions

423 of cut, as can be found in the headlines of the man

424 pages or at the beginning of the source code files.

425 The following list is roughly grouped by origin:

426 .TS

427 center;

428 l l.

429 CB UNIX cut out selected fields of each line of a file

430 System III cut out selected fields of each line of a file

431 System III \(dg cut and paste columns of a table (projection of a relation)

432 System V cut out selected fields of each line of a file

433 HP-UX cut out (extract) selected fields of each line of a file

434 .sp .3

435 4.3BSD-UWisc \(dg cut and paste columns of a table (projection of a relation)

436 4.3BSD-Reno select portions of each line of a file

437 NetBSD select portions of each line of a file

438 OpenBSD 4.6 select portions of each line of a file

439 FreeBSD 1.0 select portions of each line of a file

440 FreeBSD 10.0 cut out selected portions of each line of a file

441 SunOS 4.1.3 remove selected fields from each line of a file

442 SunOS 5.5.1 cut out selected fields of each line of a file

443 .sp .3

444 Heirloom Tools cut out selected fields of each line of a file

445 Heirloom Tools \(dg cut out fields of lines of files

446 .sp .3

447 GNU coreutils remove sections from each line of files

448 .sp .3

449 Minix select out columns of a file

450 .sp .3

451 Version 8 Unix rearrange columns of data

452 ``Unix Reader'' rearrange columns of text

453 .sp .3

454 POSIX cut out selected fields of each line of a file

455 .TE

456 .LP

457 (The descriptions that are marked with `\(dg' were taken from

458 source code files. The POSIX entry contains the description

459 used in the standard. The ``Unix Reader'' is a retrospective

460 document by Doug McIlroy, which lists the availability of

461 tools in the Research Unix versions

462 .[[ http://doc.cat-v.org/unix/unix-reader/contents.pdf .

463 Its description should actually match the one in Version 8

464 Unix. The change could be a transfer mistake or a correction.

465 All other descriptions originate from the various man pages.)

466 .PP

467 Over time, the POSIX description was often adopted or it

468 served as inspiration. One such example is FreeBSD

469 .[[ https://svnweb.freebsd.org/base?view=revision&revision=167101 .

470 .PP

471 It is noteworthy that the GNU coreutils in all versions

472 describe the performed action as a removal of parts of the

473 input, although the user clearly selects the parts that then

474 consistute the output. Probably the words ``cut out'' are too

475 misleading. HP-UX tried to be more clear.

476 .PP

477 Different terms are also used for the part being

478 selected. Some talk about fields (POSIX), some talk

479 about portions (BSD) and some call it columns (Research

480 Unix).

481 .PP

482 The seemingly least adequate description, the one of Version

483 8 Unix (``rearrange columns of data'') is explainable in so

484 far that the man page covers both cut and paste, and in

485 their combination, columns can be rearranged. The use of

486 ``data'' instead of ``text'' might be a lapse, which McIlroy

487 corrected in his Unix Reader ... but on the other hand, on

488 Unix, the two words are mostly synonymous, because all data

489 is text.

490

491

492 .SH

493 References

494 .LP

495 .nf

496 ._r

497