Spec¶
Box¶
In this program,
Box values are always expressed as (left, top, right, bottom),
in the coordinates in which
rotation is already applied, y-descendant,
MediaBox
's left-top moved to (0, 0), floats clipped to integers.
User created boxes must also be integers.
In addition, they must be unique in a page. Duplicate boxes (the same boxes in a page) are not possible.
Config¶
If there is a environment variable 'PDFSLASH_DIR'
and it is a directory,
or '$XDG_CONFIG_HOME/pdfslash'
or '~/.config/pdfslash'
is a directory,
the program registers it as user-directory.
If there is a user-directory,
and the system can use Python standard library
readline,
the program uses a file '.history'
(automatically created)
for the readline's history file
(And .python_history
for Python
command).
If there is a file 'pdfslash.ini'
in the directory,
the program reads it and update the configuration.
The defaults are:
[main]
# The ratio of computer display pixel, to PDF pixel.
device_pixel_ratio = 1.0
# Gui window position to the display margin (x and y).
# 0.0, 0.0: top-left aligned
# 0.5, 0.5: center
# 1.0, 1.0: bottom-right aligned
winpos = 0.5, 0.5
# Max pages to sample, to create a merge image (one group) in GUI.
# So when running 'preview 1-600',
# the program is acutually showing
# only this number of arbitrarily selected pages.
# '15' is briss' default.
max_merge_pages = 15
# Merge method to use in page-image-merging,
# either 'briss' (default) or 'simple'.
# 'simple' is a bit faster,
# and *may* work in some cases where 'briss' doesn't.
merge = briss
Commandline¶
- -h, --help¶
show this help message and exit
- PDFFILE¶
PDF filename to process
- --command COMMAND, -c COMMAND¶
run initial commands before showing prompt (split multiple commands with ';').
- --cmdfile CMDFILE, -f CMDFILE¶
run initial commands before showing prompt (reading from a file, one command a line).
- --nocheck, -n¶
do not perform initial commands verification. Otherwise the program aborts when a line starts with '# hash: ', and the value is different from input PDF file. This is for 'export' command.
- --_time, -_t¶
[DEBUG] print time for some processes
- --_save, -_s¶
[DEBUG] save merged image used for GUI in current directory
- --_nobanner, -_b¶
[DEBUG] suspend interpreter banner (intro)
Interpreter¶
Token separator is space, so any command or argument must not have spaces.
When the command string starts with
'#'
, it is ignored.When the command string starts with Python regex
'\[[a-z]+\] '
, the matched part is stripped.(e.g.
'[gui] crop 1 10,10,400,500'
->'crop 1 10,10,400,500'
).Page number syntax is as follows.
1-5 1 to 5 inclusive (1,2,3,4,5)
3- 3 to last page
-10 1 to 10
1^5 every other page in 1 to 5 inclusive (1,3,5)
2^6 every other page in 2 to 6 inclusive (2,4,6)
two operands must be both odds or both evens.
: all pages
~ the same pages as the previous command
Box syntax is as follows.
10,20,30,40 left,top,right,bottom
(they must be integers, no dots).
For 'modify' command, the following special syntax can be used.
For box1 (box to modify):
@ Specify box with order.
E.g. '@1' means first boxes for each page.
For box2 (box to modify *to*):
+- Apply increment or decrement
to the chosen boxes (by box1) of each page.
E.g. when box is '-3,-3,+3,+3':
20,20,400,400 -> 17,17,403,403
30,30,600,600 -> 27,27,603,603
min, max min or max numbers
of the chosen boxes (by box1) of each page.
E.g. min,min,max,+0
(select the broadest rectangle
for left, top and right,
but do not change the bottoms.)
Commands¶
commands are case sensitive (e.g.
Set
andPython
start with capital letters).When commands take optional page numbers and they are omitted, selected pages are used.
Admittedly
select
,unselect
,fix
andunfix
tend to get very confusing.But normally you don't have to think about them, until when you need them.
Interpreter and GUI are using the same undo and redo stack data.
So in interpreter, you can go all back to the initial state, through any changes done in GUI. But in GUI, undo is bound to the GUI invocation, you can't go back past the changes done in the current GUI.
select
Take one argument, page numbers.
select
page numbers.
Operations are done to only selected pages. Initially all pages are selected.
Use when you don't want to repeat very complex page numbers.
unselect : # unselect all pages
select 2-8 # select pages 2-8
crop 1-10 100,100,400,400 # crop pages 2-8
write # write pages 2-8
unselect
Take one argument, page numbers.
unselect
page numbers.
See select
.
fix
Take one argument, page numbers.
fix
page numbers.
Box operations are not done to fixed pages. Initially all pages are unfixed.
Use when you want to make some pages 'done'.
crop 2,3 150,150,450,450 # crop pages 2,3
fix 2,3 # fix pages 2,3
crop 2-6 100,100,400,400 # crop pages 4,5,6
write 2-10 # write pages 2-10
unfix
Take one argument, page numbers.
unfix
page numbers.
See fix
.
append
Take two argument, page numbers and box.
Append box.
(Add box to specified pages, keeping previously added boxes.)
overwrite
Take two argument, page numbers and box.
Replace box.
(Add box to specified pages, removing previously added boxes.)
modify
Take three argument, page numbers, box1 and box2.
Modify box.
(For each page, change pre-existent box (box1) to new box (box2). If box1 doesn't exist in any page, it is Error).
discard
Take two argument, page numbers and box.
Delete box.
(Find the box in each specified page, and remove them. If the box doesn't exist in any page, it is Error).
clear
Take one argument, page numbers.
Clear boxes.
(Delete all added boxes in specified pages. that is, they will revert to the original source cropboxes).
auto
Take one argument, page numbers (optional).
Auto detect page margins and apply (overwrite) them. All previously added boxes are removed.
If the number of previous boxes is one, the detection is done against this box, else (the number is zero or two or more), the detection is done against source cropbox.
preview
Take one argument, page numbers (optional).
Run tkinter GUI.
Options (optional):
-m
,--mediabox
:Group pages by source mediabox
-c
,--cropbox
:Group pages first by source mediabox, and then by source cropbox (for each mediabox group). This is the default.
-s
,--single
:Group each page in each group, to navigate pages one by one.
-_q
,--_quit
:Create GUI window and immediately quit (for test).
write
Take one argument, page numbers (optional).
Create new PDF file with specified (or selected) pages.
It uses PyMuPDF
's fitz.Document.save
method,
with the same arguments as fitz.Document.ez_save
,
except 'garbage=2' (instead of '3').
Options (optional):
-m
,--more
:Shortcut for
-a{'garbage':3}
. For shorter PDF, it seems OK. May make file size smaller, but it tends to get very slower.-a
,--args
:Update the default arguments. The string after, say,
-a
must be valid Python code, evaluating to a dictionary, with no spaces.
show
Take one argument, page numbers (optional).
Show current boxes for specified pages.
If selected or fixed,
pages are shown with headers 's'
and 'f'
respectively.
info
Take one argument, page numbers (optional).
[PyMuPDF v1.18.7 or later is required]
Show some PDF information for specified pages.
Page Count
andPageLabels
MediaBox
,CropBox
,BleedBox
,TrimBox
,ArtBox
,Rotate
andUserUnit
For boxes, the (almost) same values from the previous boxes are omitted.
PageLabels
and UserUnit
are omitted if they are not defined.
The values are as when PDF file was first loaded. User crop commands don't update them.
Options (optional):
-p
,--pdf
:print raw PDF string values as is. In this case, page attribute inheritances are not followed (
MediaBox
,CropBox
andRotate
).
undo
Take no argument.
Undo box operations.
redo
Take no argument.
Redo box operations.
Set
Take zero or two arguments, config option name and option value.
With no argument, show current config options.
With two arguments, set config options
Set winpos 0,0
Python
Take no argument.
Run Python interpreter,
with two variables exposed: doc
and pages
(current Document
and Document.pages
object).
You are supposed to know the source code.
For now, you can use it only for reading (not writing), otherwise, it will terribly break undo and redo.
(But if you are careful, not using undo and redo, then you may be able to save PDF file successfully).
To exit this Python interpreter,
run exit()
or send EOF
.
export
Take no argument.
Print all box edit history in chronological order.
Conceptually, if they are supplied as input again, the program should 'replay' the same edits.
(pdfslash) export | cat > log.txt
(pdfslash) exit
$ pdfslash -f log.txt some.pdf
It also prints file hash (crc32 to be exact) as comment,
and 'replay' will fail if the hash is different
from the current input PDF file ('some.pdf' in the example above).
You can disable this check with commandline option --nocheck
.
Options (optional):
-a
,--auto
:write to a file automatically with the form of '<user directory>/exported/<PDF file name>.<timestamp>.txt'
free
Take no argument.
Free all image cache in the program. Use when the program is grabbing too much memory.
(The program caches almost all GUI image and intermediate numpy arrays).
_mediabox_reset
Take one or two arguments, page numbers (optional) and tolerance.
(advanced, and experimental)
Some PDF has too many slightly different mediaboxes, for this program to be useful (unable to group pages to preview).
One way to solve is to choose some bigger mediabox, and align others to it, while discarding too different ones. From the program's design, It has to actually set new MediaBox to pages. This is a very crude procedure.
PyMuPDF removes all other boxes, CropBox, BleedBox etc. (so it says).
The process basically loads a completely new PDF, and resets the whole program (undo etc.), without exiting interpreter. Something may be broken somewhere.
The output of 'info' command will change accordingly.
---
Without optinal argument, it just reports a candidate mediabox.
It calculates to include maximum pages in some expanded mediabox, excluding bigger or smaller than the tolerance given (a pixel number).
Options (optional):
-s
,--set
:actually set MediaBox to included pages, after reporting, using reported data.
do nothing to excluded pages.
Example:
(pdfslash) _mediabox_reset 20 # Report for all pages,
# tolerance: 20 pixel.
(pdfslash) _mediabox_reset 10-400 20 # Report for 10-400 pages.
(pdfslash) _mediabox_reset 20 --set # Set MediaBox
# for all included pages.
exit
Take no argument.
Exit the program.
crop
Alias for append
.
quit
Alias for exit
.
(EOF)
Alias for exit
. Send actual EOF
.
'|' (pipe)
If any command has a string '|'
,
the output of the command is passed to the shell.
Intended for a few basic things. E.g.:
show 1-100 | grep 155
show 1-100 | cat > log.txt
(Currently, in the shell command string after '|'
,
only '>'
, '>>'
and '|'
are considered
as shell special tokens.
All other special characters are quoted,
so they may not work as expected).
GUI¶
Info¶
title bar and label show some information.
title bar example:
pdfslash: 1-13,21 (110%) [copy]
1-13,21
: current page numbers (in current group and current view).
(110%)
: current image zoom (when 100%, it is omitted).
[copy]
: stringcopy
, shown only when copy is pending (after keyc
).
label example:
1/3 both 595x841, sel: 100,100,400,500 (300x400, 1.333)
1/3
: current group number (1
) and the number of groups (3
).
both
: current view (both
,odds
, orevens
).
595x842
: current source mediabox size (GUI canvas size).left
andtop
are always zeros (0,0,595,841
).
sel
: active box (either string'sel'
or'box'
).
100,100,400,500
: active box coordinates.
300x400
: active box size
1.333
: ratio of height / width of active box.
Keyboard¶
# <Arrow> means Left, Right, Up or Down keys
mouse:
left click: start selection (top-left)
drag: expand selection
release: end selection (bottom-right)
keys:
H: print this help in terminal
q: quit
<Arrow>: move top-left point
Shift+<Arrow>: move bottom-right point
Control+<Arrow>:move rectangle
h, j, k, l: move rectangle (Left, Donw, Up, Right)
Return: crop by present selection (append)
Shift+Return: crop by present selection (replace)
n: next image group
p: previous image group
v: cycle view (both, odds or evens)
V: cycle view (reverse direction)
s: toggle souce cropbox visibility
a: cycle active rectangle
d: delete active rectangle
c: copy active rectangle
z: zoom in
Z: zoom out
u: undo (box operations)
r: redo (box operations)
(when copy is pending):
left click: paste copied rectangle
x: paste copied rectangle (the same coords)