ipxml/sgml/wml/html/newsml

This program is used to convert into, out of and between different tagged
format files such as XML or SGML or variants like NITF, NewsML, XHTML, WML or
HTML.

Generally it can be used to convert things like :
	- progs.NewsML	<-> IPTC 7901 or ANPA 1312
	- NITF		<-> plain ascii
	- XML		<-> HTML
	- SGML		<-> NITF
	- SGML		<-> plain ascii
	- WML		<-> ascii
	- HTML/NITF tables -> inline markup for Quark, progs.InDesign or other Editorial
systems

Data can be extracted from the SGML tags or attributes and formatted into text
eg.
	- convert and/or replace the data within a tag
	- plain ascii files -> XML possibly using progs.FipHdr fields to create tagged data

Definitions and Glossary
	tag	something between '<' and '>'	eg.	<BODY>
		usually ending with tagend.	eg.	<LOCATION>Hollywood</LOCATION>
	data	non-tag information		eg.	"Hollywood" in the above example
	attribute - sub field/data within a tag	eg.	<LOCATION ID="996"
PLACE="Hollywood">
	NITF	News Industry Text Format as put together by IPTC and NAA.
	XML,HTML Much simplified sub-set of SGML for WWW. - see www.w3.org

It scans its input directory and each file is processed according to a
parameter file specified either as the default or as the DY: progs.FipHdr field.

Two types of processing are possible
	- strip or modify tag, attributes and/or data
	- extract data or attribute-data and stuff in a progs.FipHdr field which can then be
used to replace the top of the file or used by a subsequent program.

There is also a question of where to send the output file as this, by default,
is put in spool/2go for IPWHEEL to distribute. So it needs a Destination(s) or
DU progs.FipHdr field. This is added by either :
	- It there is a DX progs.FipHdr field in the input file, that is used.
	- If not, the keyword 'dest' is used in the parameter file.
	- If that is not specified either, it is sent to 'woops' the Intercept queue.
	- You may also specify it from the incoming data or attribute-data using the
'fiphdr' keyword.
In this case the contents of DX, 'dest' or 'woops' will be the default if there
is no data.

IPXML may be used to convert XML tables to plain formatted text or in-line
markup such as Quark.

The parameter file in tables/sgml defaults to SGML and has the keywords:
	tag:(sgml tag name)	(optional subkeywords)
		Process a Start or End tag as follows :
		start:(progs.FipSeq)
			optional string to replace the tag
		end:(progs.FipSeq)
			optional string to replace the end tag ie. </location>
		strip:(tag|attribute|zap|everything|data|end|none)
			optional strip all or part of the tag and its associated data
			tag	All information between '<' and '>' is ignored.
				This will also zap the end tag if there is one.
			attribute all attributes are ignored; tag and data preserved.
			zap	All information - tag, attrib and data is zapped to
				the next tag.
			everything	Same as 'zap' but lower tags are always zapped too.
			data	All data for this tag is ignored; tag and attrib preserved
			end	Zap everthing, including all other tags until and including
				the end tag : </NAME> unless any other tags are specified as
				NOT being stripped.
			none	Preserve everything (default)
		keepattribute: (optional progs.FipSeq)
			Used during strip to keep all the attribute data. Any
			data after the keyword is added before and after the attribute :
				tag:ds	start:** end:-- strip:tag keepattribute:=
				<ds num="1.5" ver="orig">oinky</ds>
			gives	**=1.5==orig=oinky--
			As the optional data is will checked against the mapping tables
			please make sure they are what you want them to be.
		endkeepattribute: (optional progs.FipSeq)
			Same as KeepAttribute: (above) except the data is ONLY added after the
attribute
			and before data.
		att: (attribute name)
			used with keepattribute: - use when only one attribute is required
			tag:content	strip:tag	att:content-role start:[fip- keepattribute:|
endkeepattribute:-fip]
			<content content-ref="c00000002"
content-role="urn:x-hoho:content-role:INTRO" auto-generated="false">
			generates : [fip-urn:x-hoho:content-role:INTRO-fip]
			which can then be mangled by ipxchg or other at a later stage

		upper:	force the field uppercase
		lower:	force the field lowercase
			Note that these two conversions only change data up to the next
			tag or end tag (ignoring <P>) which may not be the end of this tag.
		list-fiphdr:P3 If converting progs.OrderedLists <ol> or unordereds <ul>, this is
the progs.FipHdr field containing the item number.
		tag:ul		  strip:tag	   start:<FipUL>	list-fiphdr:P6
		tag:ol		  strip:tag	   start:<FipOL>   list-fiphdr:P6
		tag:li		  strip:tag	   start:"\n	\P6"
			The actual string used in the Unordered list can be changed from a '*' using
the parameter 'unordered-list-chr:+'

		fiphdridx: use a link-Fiphdr (see below) to extract some progs.FipHdr data
referenced by
			tag:A		strip:tag	end:(\R7) fiphdridx:a@href=R7

		Note when specifying the tag, do NOT specify either the presy/endy ie the '<'
or '>'.
		eg	tag:location	start:[ModeBold]	end:[ql]\n	strip:tag
		There is a special case for a comment <!-- This is a comment -->, where
		the 'end' subkeyword specifies the end of the comment.



	fiphdr:(2-letter code)	(optional subkeywords)
		Either	tagdata:(name of tag)
				specify the tag name which contains the data required.
		Or	tagattrib:(name of tag),(name of attribute)
		Or	tagattribute:(name of tag),(name of attribute)
				specify the tag name and the attribute name which
				contains the data required.
		Or	data: (progs.FipSeq)
				general data to add to a progs.FipHdr field.
		Or	text:
				Stuff the first part of text into this hdr field
				This searches for the <TEXT> tag. If not found, the top of data
				is used.
				default length is 100 chrs unless you change
				with a 'max:1024' (see below)

		For any of the fiphdr-tag* options, subkeywords are 'dup', 'max',
		'upper', 'lower'

		continue: allow this fiphdr to continue and include lower level tags

		dup:(optional separator)
			Flag that this field may be duplicated. Duplicate fields are separated
			with a space unless a separator chr is also specified.
			For 'dup' to work correctly, each tag or attribute to be accessed is
			stuffed into one fiphdr line only.
			Each occurance of the duplicated tag MUST follow sequentially with
			no other tags interceeding
		incdup:
			A second method of handling duplicate tags or tag/attributes is to
			create a new progs.FipHdr field by incrementing the second letter of the progs.FipHdr
name
			eg	fiphdr:J6	tag:DEST	incdup:
				the first progs.FipHdr will be	'J6'
				the second			'J7'
				the third			'J8'
				etc
			So the idea is to start with 'J0' (zero) if under 10
			duplicates are possible or 'JA' if 26.
		maxdup: (max number of duplicates allowed for this field)
			default: no limit for 'dup', 26 for 'incdup'
			Use this to limit the number of entries in a duplicated field.

		max: (max number of chrs in  this progs.FipHdr field)
			limit the size of the data to a fixed amount
			max:25
			Note there is no default except the absolute maximum is 1023

		upper:	force the field uppercase
		lower:	force the field lowercase
			Normally these take the concept of lower and uppercase chrs
			from the LOCALE of the system you are running on. These can
			be supplemented by the 'locale' and 'extralocale:'
			keywords below.
		key: and key2:	Some XML variants reuse structures and it is the contents of
an
			attribute which describes what the data really is.
			In progs.NewsML for example there can be multiple progs.TopicSets with the attribute
'Scheme'
			on the 'FormalName' tag which varies. Use 'key' to define which one.
			eg
			fiphdr:PP tag:FormalName dup: key:TopicSet/Topic/FormalName/Scheme="Internal
MetaCodes"

			See below for more comments for use with multiple structures
			you MUST specify at least the tag and attribute in the key.

			There can be up to two 'key's for each 'fiphdr' - see below
			for an example using 2 keys are necessary for progs.NewsML Topics.

		index: (Tag@attribute)
			Create an internal progs.FipHdr for use with this index for outputting with
tag/fiphdridx above
			fiphdr:R7	tag:FormalName	dup:	key:FormalName@Scheme="Ticker"
index:Topic@Duid

		For fiphdr/tagdata there is an additional keyword of 'attribute-is-data:'.
		This forces any information in attributes in any lower tags to be treated
		as data.

		As some progs.FipHdr fields have distinct meanings - SN, DU, DP etc - please use
		2 letter codes starting N or Q.
			eg	fiphdr:NA	tagdata:itemid	dup:+
			get the data from each <ITEMID> field. If there is more than one,
			they are separated by a '+'.

		general examples
			fiphdr:PN	data:\SN	max:6
			fiphdr:HT	data:"This is the old HS =\HS="
			fiphdr:DI	tagdata:brodtext	max:200
Other keywords :
	start-text-tag: (tag)
		Tag signifying the begining of text data for 1st line (etc) of text (\$1, \$t
etc)
		The default is 'TEXT' but is often defined as 'BODY' :
			start-text-tag:BODY
		or for NITF, the body.content tag
			start-text-tag:body.content

	pinhdr:
	pindata:The <P> Paragraph tag is handled separately from other tags as it
often
		'neutral' and should not alter the current processing.
		Use these two keywords to define what to do with the start and end 'P' in
		either a progs.FipHdr field or in the data part:
		pinhdr:		start:~	end:\s
		pindata:	start:\n	end:\n
			'start:' being the string output in place of a <P>
			'end:' being the string output in place of </P>
		Note that CR NL etc are not valid characters in the progs.FIpHdr - if you do need
		them use another unique chr and use 'ipxchg' to convert at a later stage.
		Defaults for pinhdr:	start:\s	end:\s
		Defaults for pindata:	start:\n	end:\n

	dest: (one or more Fip Destinations separated by space or '+')
		This can be overridden by the DX: progs.FipHdr field. Note that all
		destinations MUST be in the tables/sys/USERS file. As per normal
		case is important, so ZAPME and zapme are 2 different destinations.
		eg.	dest:logcopy+outsgml.
	stripfiphdr:	do NOT copy the existing progs.FipHdr of the input file onto the
output.
			Normally the progs.FipHdr is stuck on top.
	nofiphdr:	do NOT add a progs.FipHdr to the output file. Any new progs.FipHdr keywords are
			added without the tilde NL top and bottom.
	zapfiphdrfields: (List of progs.FipHdr fields to zap)
		Delete all occurances of the progs.FipHdr fields specified. This is ONLY valid
		where the progs.FipHdr from the input file is retained for the output.
		In this case it is normal to zap :
			zapfiphdrfields:XZ,XS,CX,DC,SZ,CQ,CP,XP
	addhdr-file: (fullpath/filename in progs.FipSeq)		default: none
		Extra, optional progs.FipHdr information held in an external file
	addhdr-script: (script in progs.FipSeq)			default: none
		Extra, optional progs.FipHdr information generated by an external program or script
		addhdr-script:/fip/local/find_iim.pl \EP/\EN > \E3
		Temporarily, 3 progs.FipHdr fields are available for the script :
		\EP holds the input folder
		\EN holds the input filename
		\E3 hold the name of a TMP file to create that will be read for the list.
	extra-fiphdr: (progs.FipSeq)					default: none
		Extra, optional progs.FipHdr information - note this overrides the -h switch

	use-sx:
or	use-external-file:
		if there is an SX progs.FipHdr field with a path to the data file, use that rather
than the data in the input file.

	filename: (progs.FipSeq)	New filename for the output file name.
	supercede:
or	overwrite:	Where 'filename' has been specified, if there is already a file
			with that name in the output queue, it is deleted first.
	script: (path and name)	Script to run AFTER processing.
			The output filename and path is added to the script before running.
			Care must be taken NOT to run a script on a file that
			normally is written to a spooled queue.
			For example, the default output queue is 'spool/2go' where
			program 'ipwheel' may have already processed the file (and
			possibly deleted it) before the script has had time to
			function. So it is normal to specify a holding queue, not
			used by any other program as 'outque:'
			The script must therefore delete the file after use OR
			delete them all in the nightly maintenance - 'zapfiplog'
			Note also that script called only once at the end of
			the file. Use split-script: to run on each split (if using splits).
	outque:		Output Queue for the output file.
			This default to the '-o' input switch which defaults to spool/2go.
			If the first chr is NOT a '/', it is assumed under spool.
			The default is outque is used in preference to -o,
			UNLESS the -V switch is on were -o is used over outque.
	doneque:	Done Queue for the raw input file.
			This default to the '-d' input switch which has no default.
			If the first chr is NOT a '/', it is assumed under spool.

	before: (progs.FipSeq)	String to parse and add at the top of the file.
	after: (progs.FipSeq)		String to parse and add at the end of the file.
	beffile: (Path/filename) Contents of a file in progs.FipSeq to parse and add at the
				top of the file (after 'before')
	aftfile: (Path/filename) Contents of a file in progs.FipSeq to parse and add at the
				bottom of the file (before 'after')
	number:octal|dec|hex	In FipSeq, make all escaped numbers Octal, Dec or Hex.
				default is octal
	log:	Custom log line for the Fip Item log in progs.FipSeq
		default is name of the parameter file (DF) and filename (SN)
	archive: (Archive Name)	Archive all incoming raw data using this
		parameter file. The 'archive Name' can be FipSeq.
		This adds the file to the normal Fip archives in /fip/log/data
		It should be purged using 'ipmaint'.
		eg 	archive: \SU
		or	combie:QS	SU|NS,rawdata
			archive:\QS
		ie Use the contents of progs.FipHdr SU, if not there, NS, if not there
		just use the word 'rawdata'.

	striptags: Strip all tags EXCEPT those specifically stated using the 'tag'
keyword.

 	default-strip: (tag|attribute|zap|everything|data|end|none)
 		default strip all or part of the tag and its associated data
 		(see strip: above for descriptions)

	ignore-non-xml-data: If there is any text or data BEFORE the start of the XML
		document or any after the end of the last End Tag, it is stripped.
		Normally it is preserved and output.

	locale:(valid locale)
		Change the locale from the System Locale to this
		The locale MUST be valid !
			locale:dk
	extralocale: (2chr combinations)
		For changing uppercase to lower and vice versa, we can add to the
		normal locale by specifying a series of 2 letters which the lower
		then the upper.
		The lowercase chr is 1st then the upper, then a separator or space.
		eg	extralocale:aA,bB,cC,dD,\212\232,\213\237
		Normal a-z/A-Z are by default : in the example above they are included
		to give an idea of syntax

	chr:(octal/dec/hex number):(progs.FipSeq string)
	hdrchr:(octal/dec/hex number):(progs.FipSeq string)
	txtchr:(octal/dec/hex number):(progs.FipSeq string)
		Replace this character with the string - usually an Sgml escaped chr.
		USE THIS TO REPLACE SINGLE CHRS WITH SGML CHRS (ie opposite of 'sgmlchr:'
below).
		This can be a printable chr or an escaped number. The number is
		octal/dec/hex depending on the preceding 'number' keyword (if any).
		eg	chr:\313:&pound;	chr:<:&lt;
			Note that the ';' is part of the string and NOT a comment as it does NOT
			start the line.
		hdrchr	works on new progs.FipHdr fields only.
		txtchr	works on data and when data is taken from a progs.FipHdr field and
			added to the data part of a tag.
		chr	works on both data and new progs.FipHdr fields.


	eoln:	Convert Line Ends (ie CR and/or NLs) from the outbound feed.
		SGML should be terminated CR NL	:			eoln:\r\n
		for HTML (default) the progs.EndOfLine is NL only :		eoln:\n
		for NO eoln, specify NO subparameter :			eoln:
		The subparameter can be any valid FipSeq.
		(SGML uses the term 'RE' (record end) for Carriage Return CR and
		'RB' for progs.LineFeed NL meaning record begin.)
		Note that, unless using the 'preserve-multiple-eolns', you should map
		eoln to something unique like eoln:<mypara> as normally CR NLs are reduced to
		a single End Of Line.
	preserve-multiple-eolns:
		Normally multiple end-of-lines are stripped as they
		are meaningless in the XML world. Use this to preserve them!
	preserve-top-spaces:
		Do NOT strip all spaces and blank lines at the top of the output file.
	preserve-padding-spaces:
		Do NOT strip all spaces and blank lines at the beginning of each tag.
	strip-multiple-spaces:
		Strip all multiple spaces and blank lines inside each tag.
	allow-presy-in-tag: In XML/HTML etc, reserved chrs like '<' or '>' cannot
appear inside
		the attribute data of a tag - they must be encoded like &lt; etc.
		Use this where there might be some non-comforming stuff. However the
		drawback here is that they MUST be inside dbl qtes ie <meta ds="helle<p>ooo"
	convert-CDATA-sections:
		convert-CDATA-sections:no	- no dont ! (default)
		convert-CDATA-sections:yes	- yes pls and zap the '<!CDATA[' and ']]>'
		convert-CDATA-sections:preserve	- yes pls and leave the '<!CDATA[' and ']]>'
		Normally a CDATA section like :
		<![CDATA[ Vongerful Vondafool C&oe;penh&areing;gen <99thisIsAnon-compliant
XMLtag> ]]>
		is considered a single, raw string of XML/SGML data. And all the tags and
		entities (like &lt;) are not changed either. Use this parameter
		to convert them.
		Note that you should use this option CAREFULLY if any tag in the CDATA
		is the same as a tag in the main envelope. See below for more comments.

	sgmlhdrchr: (progs.FipSeq string) : (progs.FipSeq Chr or String)
	sgmltxtchr: (progs.FipSeq string) : (progs.FipSeq Chr or String)
	sgmlchr: (progs.FipSeq string) : (progs.FipSeq Chr or String)
		Translate Sgml escaped chr back into a single chr or a string.
		USE THIS TO REPLACE SGML CHRS WITH A CHR OR A STRING (ie opposite of 'chr:'
above)
		Sgml escaped chrs always start with a '&' and end with a ';' : "&gt;",
"&copyright;"
		Note that case of both parameters IS important  - These two are different :
			sgmlchr:Oring:<CapOring>
			sgmlchr:oring:<smallOring>
		This will take &XXXX; and translate it.
		eg.	sgmlchr:lt:<
			sgmlchr:oumlaut:\202
			sgmlchr:Utilde:{tildeU}
		sgmlhdrchr	works on new progs.FipHdr fields only.
		sgmltxtchr	works on data and when data is taken from a progs.FipHdr field and
				added to the data part of a tag.
		sgmlchr		works on both data and new progs.FipHdr fields.
		NOTE that if the input is any NITF, XML or HTML feed and the output
		is just plain text, then you almost always need :
			sgmlchr:lt:<
			sgmlchr:gt:>
			sgmlchr:amp:&
			sgmlchr:apos:"
		BUT you will want to preserve them /leave them alone if the output is
		the same or another NITF, XML or HTML flavour.

	unicodechr: (Decimal chr ) : (progs.FipSeq Chr or String)
		For all unicode chrs which are >= 256 (xA0), you can specify a map to a
single chr or a string.
		The chr can also be specified as hex with a preceeding 'x'
		Commonly used ones are :
			; trademark
			unicodechr:x2122:(tm)
			unicodechr:8194:\s
			unicodechr:8195:\s
			unicodechr:8201:\s
			unicodechr:8211:-
			unicodechr:8212:_
			unicodechr:8216:'
			unicodechr:8217:'
			unicodechr:8220:"
			unicodechr:8221:"
			unicodechr:8249:<<
			unicodechr:8250:>>
			; euro in a table
			unicodechr:8364:EUR
			; fractions 1/3 .. 1/5 .. 1/6 .. 1/8 ... 7/8
			unicodechr:x2153:\s1/3\s
			unicodechr:x2154:\s2/3\s
			unicodechr:x2155:\s1/5\s
			unicodechr:x2156:\s2/5\s
			unicodechr:x2157:\s3/5\s
			unicodechr:x2158:\s4/5\s
			unicodechr:x2159:\s1/6\s
			unicodechr:x215A:\s5/6\s
			unicodechr:x215B:\s1/8\s
			unicodechr:x215C:\s3/8\s
			unicodechr:x215D:\s5/8\s
			unicodechr:x215E:\s7/8\s
			; progs.ByteOrder ?? x.feff d.65279 o.177377
			unicodechr:65279:\s
	convert-unmatched-unicodes: (progs.FipSeq Chr)
		Single chr to represent a unicode chr which is NOT latin1 and NOT matched in
'unicodechr'
		default: '?'
		Normally these will be mapped to '?'.
		To pass-thru all unmatcheds, use : convert-unmatched-unicodes:passthru

	hdr-strip-between: start:(progs.FipSeq Chr) end: (progs.FipSeq Chr)
		Where the 1st 9 lines of text are used in progs.FipSeq using \$1 etc,
		use this to replace any tags with a space.
		Normally the following would be used :
			hdr-strip-between:	start:<	end:>
		But if you have mapped the start/end tags to other chrs in ŽipxchgŽ
		(possibly to control the tags and replace later with 'txtchr')
		eg	; for lines used in progs.FipSeq - like ŽbeforeŽand ŽafterŽ
			hdr-strip-between:	start:\201	end:\202
			; for text lines - Convert back from 201 202 <>
			txtchr:\201:<
			txtchr:\202:>
	sgmlchr-file:(filename)
		Use this to pull in a standard XML Entity file such as found at
			http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
		See also the note on utf-8 below.
		Each line has an entry of :
			<!ENTITY hearts   "&#9829;">
				<!-- black heart suit = valentine, U+2665 ISOpub -->
	convert-all-other-entities: This flag will automatically convert all
		entities NOT covered by chr/hdrchr/txtchr.
		If the entity is a number	&#136;, it is converted to one or more bytes
		If the entity is a name		&euro;, it is converted to '(euro)'
	raw-data-type:ascii/utf8/utf16		  default: ascii for 8bit chrs
	case-sensitive-tags:YES/NO
		In SGML and variants - HTML, early variants of NITF - the tag names are
		case INsensitive - ie <BODY> is the same as <Body> == <body>
		Ignoring case is the default for 'ipxml'
		BUT XML tags nowadays are case-SENSITIVE. So if you need to
		Our general view is that no sane person would run tags with the same name
		but with diferent case - but then we are not the experts !
		Use 'case-sensitive-tags:yes' to turn this ON.
	** This must be specified at the TOP of the parameter file BEFORE any fiphdr:
or tag: !!
	** This must be specified at the TOP of the parameter file BEFORE any fiphdr:
or tag: !!

	unordered-list-chr: (chr or string)
		This changes the actual string used in the Unordered list.
		Default is "*".
	replace-fiphdr-tilde: (progs.FipSeq chr)
		If a Tilde is found in a fiphdr field, replace it with this chr - default
0376
	replace-fiphdr-eoln: (progs.FipSeq chr)
		If an end-of-line (<p>, <br> CR, NL or CRNL) is found in a fiphdr field,
replace with this chr - default is SPC
	alt-param-file:(text)	(param file name)
		alt-param-file:<AlertML		alertml.fip
	add-EQ:
		add the input folder as an extra progs.FipHdr field EQ
	cont-chr: (progs.FipSeq chr)			default: 021 (DC1)
		Single chr to be used internally for flagging Continuation
		progs.FipHdrs (ie for fiphdr:AB	tag:hoho	continue:)
		Use this if a 021 (hex 11) chr is valid data.
	cont-zap-chr: (progs.FipSeq chr)		default: 022 (DC2)
		Single chr to be used internally for flagging Continuation
		progs.FipHdrs (ie for fiphdr:AB	tag:hoho	continue:)
		Use this if a 022 (hex 12) chr is valid data.
	max-total-fiphdr-size: (total)		default: 32k chrs
		Max size of all progs.FipHdr fields
	max-single-fiphdr-size: (total)		default: 4000 chrs
		Max size of a single progs.FipHdr field
		This overrides the -F input switch

	wrap-lines: (no of chrs) : (fipSeq)	default: no wrapping
		Wrap text lines (but NOT plain text tables if processing tables) to this line
length and insert the string
		; wrap NONtabular stuff at 80
		wrap-lines:80:<fipWRAP>\n
	ignore-xml-in-wrap:no or (number)	default: no
		ignore any xml in the calculations for the linelength
		the number can be the amount to add for each xml tag - generally 0
		; dont add anything for XML
		ignore-xml-in-wrap:0
	abstract-size: (number)
	abstract-fiphdr: (2 letter progs.FipHdr code)
	abstract-msg: (message in progs.FipSeq)
	stop-after-abstract:yes/no
		Create an abstract/first part of text when the derived data is equal or
exceeds the abstract-size.
		If a file is smaller than the size, only a single, complete file is output
		Default - no abstract at all.
		The fiphdr is used to flag if the file is the Abstract or the Main
		Stop flag can be used to Not continue with the Main file (default is both
files)
		The optional msg is inserted at the bottom of the text of an abstracted file.
			abstract-msg:\n\n***Abstract finishes, pls view original for remaining
text***\n
	log-level: (number)
		10 - default	
		20 - log all tables

Input parameters (all optional) are :
Either
	-1 : filename for a single shot		default: spooled
		often this flag is used with -S (newname)
		to create a file called (newname) in spool/formsave
		for the progs.DataFormatting module
or	-i : spooled input queue to scan	default: spool/2sgml
or	-I : scan input queue and 		default: spooled
		stop after the last file has been processed.

	-o : output queue 			default: spool/2go
	-d : done queue for original raw data	default: none-input deleted
	-D : display tags			default: no
		use this ONLY when running '-1' single shot to
		display all tags, attributes and levels and their data.
		ie use to debug/tune.
	-F : default max size of a single progs.FipHdr	default: 4000
	-h : optional extra progs.FipHdr string to add 	default: none
	-l : log every new file pls		default: do NOT log
	-L : log every new file pls with times		default: do NOT log
	-Q : quiet flag - do NOT flag minor errors	default: do
	-S : save this file in the save area	default: spooled output
		with the following name
		eg : -S "#SN:\XK#PP:\PP"
		use this for progs.DataFormats (same switch as ipformat)
	-t : scan time for the directory	default: 2 secs
	-T : different folder under /fip/tables	default: sgml
		This should only be used when upgrading and you need to run 2 ipsgmls
	-V : use the content of the -o input switch for the outque	default: use outque
if it exists
	-w : file wait time for files arriving	default: none
		across a network (for NFS, make about 10 secs)
	-z : name of the default parameter file	default: tables/sgml/SGML
	-Z : force the parameter file to this	default: DY or -z name
	-v : print version no and exit

**************** Notes ***********************

**** For Debugging, you can manually run the program with the -1 Single shot
switch with the -D to display all the tags in an input file
	CMD>ipsgml -1testfile -D -zNewsML.fip -otestfolder | more
	This will create a file in /fip/spool/testfolder

**** Rarely will you want 'sgmlchr:' and 'chr:' in the same parameter file -
chr converts single chrs to sgml chrs and sgmlchr converts them back !

**** 'sgmlchrs' are done first BEFORE 'chrs' Then any Upper/Lower case
conversion ****

So if you have a 'before', 'after' string (or files) withe embedded SGML tags
BUT still need to catch chrs '<' and '>' :
	1. in the 'before' string, chg all < to { and > to }
		eg  before:{!DOCTYPE abc.dtd}\n
	2. change < and > using txtchr
		eg  txtchr:<:&lt;
			txtchr:>:&gt;
	3. change { and } using txtchr
		eg  txtchr:{:<
			txtchr:}:>


**** Extra progs.FipHdr fields are available to use :
	Z1 is the size in bytes of the data part of the document (ignoring before,
after, beffile and aftfile).
	Z2 is the size in bytes of the data of the document ie ignoring tag (ignoring
before, after, beffile and aftfile).
	if you are using Z1 and Z2 already, populate 2 other fields by :
		newZ1: 2 letter code replacing Z1
	eg	newZ1:VT
		will put the sizes in progs.FipHdr fields VT and VU.

**** Extra System Variables are :
	\$1	first line of text
		...
	\$9	ninth line of text

**** NULs (characters of binary zero) are stripped from the output file.
	So a parameter like the following will have no effect at all !
		tag:ds	start:\000

**** Current Limitations are :
	No more than 2 million tags may be specified.

**** If there is NO progs.FipHdr or the SN field (which should be the name of the
file) is missing, the original filename is used as the SN. Any hashes ('#') in
this created SN field are changed to hex.9d/oct.235/dec.157

**** Program change - from version 14+, please use 'preserve-multiple-eolns' to
keep ALL the end of lines of non-xml data.

**** CDATA fields
Note that an XML CDATA field is specified as  tag named '![CDATA' - ie without
the trailing '['.

**** Splitting flies

For SGML/XML files that contain multiple 'things', there is a means of
splitting these either into descrete files or into a single file with a
Splitter string/tag and progs.FipHdr pertaining to just that file.

	Eg	You might need to split off each ARTICLE from the following structure BUT
still retaining the Page info
	<PAGE>
		some relavant page info
		<ARTICLE>
			some relevant article info
		</ARTICLE>
	</PAGE>


Where a single output file with one of many 'splits' is required, use the
following parameters :
	split-on-tag: (tag)
	split-on-endtag: (tag)
	split-on-tagattribute: (tag),(attribute)
		Create a split on this tag or tag/attribute
		The split is put BEFORE the start or AFTER the end tag depending
		on the option chosen.
	split-on-level: (number)
		While you can NOT specify trees for 'split-on-tag' (or tagatt), you may
specify
		the level at which the split MUST tale place. So that if you have multiple
		levels of embedded tags - like progs.NewsMl progs.NewsComponents for example, use this to
		decide which level.
		eg : If you have NewsML/NewsEnvelope/NewsComponent/NewsComponent
			use split-on-level:4 to split ONLY on the 4th level, not the 3rd.
			use the -D input switch to show levels for a single file.
			This parameter has nothing to do with cooking.
	stop-on-tag: (tag)
	stop-on-endtag: (tag)
	stop-on-tagattribute: (tag),(attribute)
	stop-on-level: (number)
		ditto - but stop processing
	splitter-string: (progs.FipSeq)
		This is placed in the data to signal the start of a new bit; progs.FipHdr follows.
		splitter-string:********** BRS DOCUMENT START *************
		Where a single output file is required, this is placed in the data to signal
		the start of a new bit; progs.FipHdr follows.	default is "\n<FIP-SPLIT>"
	new-file-on-split: (progs.FipHdrField for Seqno)
		Instead of putting all the splits in one file with a <FIP-SPLIT> between
		this option creates a completely new file. The progs.FipHdr specified will
		contain the sequence number of this file from 1.
		new-file-on-split:NZ
	split-on-no-data:
		Normally only if the previous element had data will it be ended and
		the next file started. Use this flag to force a split EVERY time
		the split criteria is met, ignoring if there was any data.
	split-script: (path and name)   Script to run AFTER processing this file
	table-width-fiphdr: (progs.FipHdr field)
		This progs.FipHdr will contain the maximum width of the table.
		eg 	table-width-fiphdr:AB
	table-width-minimum: (width)
		If 'table-width-fiphdr' is specified, make it a minimum of this. def. none
	strip-trailing-table-spaces:no/yes
		If there are any spaces atthe end of a table row, delete them (default)

	NOTE that if you are running splits, then you PROBABLY want to keep the
FipHdr.
	This is because there is often a chunck of metadata BEFORE the split which
	needs to be saved for EACH split - and it has probably been stuffed in the
FipHdr.

**** Multiple specified Structures

progs.NewsML progs.TopicSets and other multiple specified structures
Considering a structure like :
	<TopicSet FormalName="Companies">
		<Topic Duid="T00001">
			<TopicType FormalName="Company"/>
			<FormalName Scheme="Listed Companies">PNOK.L</FormalName>
			<FormalName Scheme="Nasdaq codes">PNOOK</FormalName>
			<Description>Pocket Nook Corp</Description>
		</Topic>
		<Topic Duid="T00002">
			<TopicType FormalName="Company"/>
			<FormalName Scheme="Listed Companies">FIP.L</FormalName>
			<FormalName Scheme="Nasdaq codes">DRIVL</FormalName>
			<Description>Mega Fip Corp</Description>
		</Topic>
	</TopicSet>

; get the Listed Coys and use '+' as a separator
fiphdr:YC tag:TopicSet/Topic/FormalName dup:+
key:TopicSet/Topic/FormalName/Scheme="Listed Companies"
; get the Nasdaq codes and use '*' as a separator
fiphdr:YN tag:TopicSet/Topic/FormalName dup:*
key:TopicSet/Topic/FormalName/Scheme="Nasdaq codes"
; use U1, U2 etc as holders of the descriptions
fiphdr:U1 tag:TopicSet/Topic/Description incdup:
would give new progs.FipHdr fields of
	YC:PNOK.L+FIP.L
	YN:PNOOK*DRIVL
	U1:Pocket Nook Corp
	U2:Mega Fip Corp

**** Interpreting Tables

IPXML may be used to convert XML tables to plain formatted text or in-line
markup such as Quark.

The two main, and exclusive, uses are
	1. format table rows into plain text rows where the columns line up.
	2. add inline markup dependent on the table and the row.
This inline markup can be anything - Quark Tags, CCI, Atex, progs.MediaSystem Justif,
progs.InDesign etc.
A note of caution - IPXML will format tables (and tables within tables) with up
to 108 (was 62 until version 19g3) rows each. Any more -  use the data
formatting package.

For Lining-up-columns, it spaces out all the columns to the maximum in the
table. If there is an 'align' attribute, then the data is aligned according to
that. Otherwise the first column is flush LEFT and the rest flush RIGHT.

How does it work ?

Data for each row is held as progs.FipHdr fields (usually UA-UZ then U0-9 then
VA-VZ).

At the end of the row, it is output as a row using a progs.FipSeq line which defaults
to :
	(spc) \UA (spc) (spc) \UB (spc) (spc) ..... \r\n
for the number of columns in that table.

This output can be replaced by using either the 'default-class' parameter or
the 'class' attribute on a 'TABLE' tag.

So if there is a <TABLE class="soccer-score">, then a file in
tables/sgml/class/SOCCER-SCORE should contain one or more of the following
keywords :
	table-start:[font=HelveticaBold][pointsize=16]SOCCER SCORE[quad]\n
	table-end:[quad]Data Supplied by Fippies.[quad]\n
	table-row:[font=Helvetica][tab][bold]\UA[roman][tab]\UB[tab]\UD[quad]\n

The table-start is produced BEFORE the table, the table-end after, which each
row has the table-row applied.

Note that in the above example we missed out the third field \UC - there is
noting to stop you rearranging the fields and NOT specifiying the data.

Also you may use the lovely progs.FipSeq 'partial', 'combie', 'unique' etc to play
aroungd with the data.

If you do NOT specify a complete output line with table-row (or thead-row),
there are parameters for adjusting the look :
	column-gap: (progs.FipSeq string)
	row-start: (progs.FipSeq string)
	horiz-rule: (progs.FipSeq chr)
These allow you to specify the actual chrs that will start a table data line
and the gap between each column and the character or string to use if an <HR>
occurs in the table.
eg Start each line with with a (hyphen) (space) and the gap is 4 spaces and
horiz rules are multiple '+'.
	column-gap:\s\s\s\s
	row-start:-\s
	horiz-rule:+

Keywords in the main parameter file
	format-tables:
		This is necessary to flag that the tables need formatting.
	default-class:(default-class)
		name of a file in tables/sgml/class holding Styles for outputting each row.
	line-up-columns:
		This flags that the data will be space padded to line-up the columns.
	column-gap: (progs.FipSeq string)		default is 2 spaces
	row-start: (progs.FipSeq string)		default is 1 space
	row-end: (progs.FipSeq string)		default is NL
	horiz-rule: (progs.FipSeq Chr)		default is '-'
	bullet: (progs.FipSeq Chr)			default is '*'
	newUA: 2 letter code replacing UA as the first column of a row.
		Both must be a letter and the first cannot be 'Z'.
		The second will always be 'A'.
	fiphdr-for-table: (progs.FipSeq string)	default: none
		Extra progs.FipHdr to add if there is a table in the data.
	split-tables-and-text: (progs.FipHdr)
		Add Marker in text Or create NEW file on tables/text transition.
	split-tables-into-files:
		use this to split the incoming file into discrete files for tables and
non-tables
			The default is NO to add the <FipSplitTables> string
		For files - A new file is created on start and end of table
		and the progs.FipHdr is used to hold the Sequence number of this take.
	fiphdr-for-text:  (progs.FipSeq string)	default: none
		Extra progs.FipHdr to add if this subfile is a text element.
		This is ONLY if the 'split-tables-and-text' is specified.
	use-pi-widths:yes
	pi-colwidths:IDNtableColWidth
		This expects a PI tag with colwidths eg
			<?FingerPost IDNtableColWidth="10 10 10 20" ?>
		Default no
	wrap-table-cells:no/yes/(number)
		This has 2 purposes -
		- with a number : optimum col width of a table (ie dont squeeze xxxtoo much
!)
		- or Make this NO to automatically calculate the max width of each column and
space out accordingly.
		Default is YES.
	max-col-width: (number)
		Force the colwidth to be a max of this number.	Default: no max.
		If there are 2 numbers, the 1st is the 1st col and the 2nd is the 2nd and
subsequent cols
		ie make the first col a max of 30 chrs and all others 20
			max-col-width:30,20
	interpret-style:idx/bizwir/html
		Interpret some css attributes if there are any (currently just alignment)
		Parameter 'idx' states they are HTML Tidy styles which are numeric from 1-n
		Parameter 'html' looks for ordinary html 'left', 'right', 'center' (note it
is only looking as a class name NOT a style)
			<td class="bold32 spc33 right">
		Parameter 'bizwir' looks for progs.BusinessWire classes 'bwtextalign...'
			<td class="bwcellpaddingleft0  bwverticalalignbottom bwtextalignleft
bwsinglebottomborder">
		or	<td class="bwpadl0 bwnowrap bwpadr0  bwvertalignb bwalignl
bwsinglebottom">
	max-cols-per-row: (number up to and including 108)
		default is a max of 108 columns per row (was 62 until version 19g3)
		Use this to allow up to 108
		The data in columns that exceed this is ignored

In the CLASS file
	table-start: (progs.FipSeq)
	table-end: (progs.FipSeq)
	table-row: (progs.FipSeq)
also same three for THEAD, TBODY and TFOOT. Eg:
	thead-start: (progs.FipSeq)
	tbody-row: (progs.FipSeq)

----------------------------------------------------------------------------

Version Control
;019g7	04jul11 redid endtags ;3 bugette with > 26 columns ;4 -V added ;5 woops
endtags/tables ;6 added filter
		;7 valgrind cleanups
;019f41 17may06 bugette in progs.EndTags when strip:none
	;a-c 21sep06 added new Xmlinternals progs.TagSpecial (b nasty bug in 19a) (c CDATA
quirk)
	;d 14aug07 tweak to trees
	;e1-22 20sep07 more on plain text tables - added Rowspan properly
	(;19 added addhdr-script ;20 added -T ; 21 bugette DX and 'dest' were swopped)
	;23 2apr08 bugged in wrap ;24 9may08 added log-split ;25-26WINNT + key
bugettes
	;e27-35 23jun08 for strip:everthing and end tags and tables with no rows (35
utf8 bugette)
	;e36-37 22oct08 added -F and max-single-fiphdr-size
	;e38-40 27oct08 Bizwir - sup and inf added (plus start table strip bugette)
	;e41 15dec08 added split-on-endtag: ; 42 internal-tuning wrap buffersize
	;f1-3 01feb09 made Tag structure variable to cope with files > 2million tags
	;f5 27feb09 added stop-on-tag/att/endtag stop-on-level
	;f6-14 20mar09 added abstract-fiphdr and abstract-size ;9 bugette ;15
maxStyles->3000
		;16 rework levels in common ;17-24 bugette for progs.FipHdrs > 64k (commonxml too)
		;19 21oct09 minor check on abstract ;23-24 redid styles ;26-27 19jan10 added
embedded tables
		;28-29 22feb10 colspans with no col
		;30 21mar10 bizwir class names change
		;31-33 3jul10 bugette in very large spanned columns and in styles and added
-h and extra-fiphdr:
		;34 13aug10 wrinkle - table with ONLY colspans - and preserve utf8/16 if
output-data-type=raw-data-type
		;35-36 23aug10 added convert-unmatched-unicodes:pass-thru
 		;37-40 31jan11 added use-sx and Style for LSE fix/fast
 		;41 11may11 added default-strip:
;018z	21apr04	;a-b more tables2text cleanups
	;c woops split-script went missing
	;d-e 28apr04 added tag:XX strip:everything
	;f-h 01jun04 bugette in R-dualRics and Unlisteds
	;i 30jun04 bugette in tab2txt
	;j-k 02jul04 added maxdup:
	;l-p 14jul04 (protect isspace with 0200) plus tables-if no PI, use wrap
	;q-r 02sep04 added no-break for <> in wrap_cell
	;s 17sep04 speedy
	;t-u 04nov04 Rtrs-UL now in roychk
	;v-z 05oct05 strip:none was missing the end tags
;017z	29may03 added eoln-in-fiphdr plus alt-param-file
	;b-d 05jun03 2 bugettes - minor
	;e 10jun03 no-data: added to fiphdr;
	;f 20jun03 table Priority
	;g 30jun03 PI-FingerPost progs.IDNColWidth added
	;h 17jul03 make list-fiphdr visible (ie leave in FipHdr, not zap)
	;i 21aug03 bugette in specials
	;j-m 31oct03 timings and very big progs.FipHdrs
	;n 26nov03 added parseable doneque
	;o-q 12jan04 bugette in wrap cells and slim_fiphdr/max-fiphdr-size
	;r-s 04mar04 added progs.FipHdr EQ (input queue) on 'add-EQ'
		and bugettes in special RORIGIN2 and added cont-chr/cont-zap-chr
	;t-u 13mar04 allow lead/trailing spaces in continuation tags
	;w-z 06apr04 zap any sundry tags inside a table - for now.
;016z	23jun02 preceeding and trailing blank lines can be tables when
splitting.....
	;a 10jul02 bugette in splitting tables
	;b 18sep02 added -D and -S for ipformat compatibility
	;c/d/e 17oct02 added progs.TableSplit string
	;f 31oct02 added single quotes too
	;g/h 25nov02 cleanup tables and ignore # in linkhdr for reference
	;i 04dec02 added style-CharWidth for tables and Lists
	;i/j/k/l 12dec02 BUG with large files and allow continuations
	;m-w 19dec02 added row-end plus bugette in get_duid/links
	;x 25apr03 added -P
	;y-z 14may03 remove trailing spaces from a table line and replaceTilde
;015z	10oct01 added table processing and added convert-all-other-entities:
	last end-tag was NOT being handled correctly
	;c 15nov01 bugette with last tag if PRE
	;d 19nov01 added preserve-padding-spaces
	;e 21nov01 added sgmlchr-file
	;f/g 22nov01 added convert-to-utf-8 and bugette - spaces before/after attrib
values
	;h 03dec01 bugette with duplicate fiphdr fields
	;i 11dec01 cleanedup splits and added split-on-no-data
	;j 28dec01 tables cleanup
	;k 10jan02 added split-script and more on tables
	;l 17jan02 more on tables plus endtags not correct on continuations
		plus handling DOCTYPE attributes better
		plus handling Comments redone
	;m 22jan02 allow trees for fiphdr:AA tag:a/b as well as tagatt
	;n/o/p 28jan02 added 'fiphdr-for-table' and 'split-on-level'
	;r/s 09mar02 added -I
	;t/u 16apr02 order of ending file is now Check Dups, then Mandatorys then
Standing
	;v/w/x 22apr02 bugette in line-up-cols with keepattributes and allowPresyInTag
	;y/z 27may02 added 2nd key and link on fiphdr
;014j	16nov99	sort_out_tags
	;a incdup now starts at A not B and new seqno_it
	;b 28apr00 added levels/end/standAlone in sort_out_tags
	;c/d 27apr01 added preserve-multiple-eolns
	;e/f 31may01 added CDATA and PI-processInds and new-file-on-split
	;g/h 25aug01 bugs ! - continuation text and keepattribute and added locale
	;j 03oct01 added ignore-non-xml-data and redid splitters

(copyright) 2011 and previous years progs.FingerPost Ltd.
Topic revision: r1 - 21 Jan 2005 - 13:22:50 - TWikiGuest
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback