4 Extending Modules

4.1 Chapter Overview

The fourth chapter describes extending modules. In contrast to the core features, each module affects only a minor part of the application, not the entire system. In the context of a content management system, these extedning features are usually implemented as content modules. A particular content module is not an essential part of the system, instead it enhances its function. Urchin CMS has already included a set of basic content modules, e.g. Content module for publishing simple texts.

The first part of this chapter shortly describes several content modules coming with the default installation of the system. Content modules are used to manage and publish content on the website and adding it to the pages. The second part discusses two important concepts, these concepts are forms on the front-end and search. Full-text search is used to search for keywords in the website content and forms to collect data from users. Both topics include general overview of various approaches and a detailed solution used in Urchin CMS including corresponding content modules.

4.2 Content Modules

Concept of the component axis and modules has been already discussed in the previous chapter. Content modules are used to manage website content of various type. This section briefly describes several modules coming with the Urchin installation, their features and typical use. In addition to default capabilities, all modules might be adapted to specific requirements.

Table 4.1 shows a simple overview of content modules including a number of elements and a short description. Number of elements the module uses depends on the content module type. In the following text, all content modules are sorted alphabetically. Modules Search, QuickContact and Form will be described in two separate sections along with related concepts and general overview.

Table 4.1: Overview of basic content modules

module	elements	each element represents
Articles	many	single article with perex and content
Content	one	piece of structured text
Enquiries	one	poll with question and answers
Events	many	event with dates and description
Forms	one	dynamic form with custom fields
Galleries	one	gallery with multiple images
News	many	single new
QuickContact	one	pre-defined contact form
RSS	many	single data feed
Search	zero	n/a
Sitemap	zero	n/a

4.2.1 Articles

Articles module allows adding multiple articles per component. In contrast to news, articles require filling in both perex and article content. Other fields are optional, e.g. date or preview images. Articles have always available both list of articles and their detail.

4.2.2 Content

Content is the simplest but most fundamental content module. Each component of this module contains only a single element with formatted text. This text is edited by a WYSIWYG editor and allows storing HTML content. Content module is widely used for text pages with formatting, images, as well as for minor text blocks, e.g. a short note, or a simple banner.

4.2.3 Enquiries

Enquiry module provides a basic tool to interact with a visitor. A component of this module has only one element that is the enquiry itself. The enquiry consists of one question and two or more answers. Number of votes is tracked for each answer. Voting attempts are logged and the module includes a simple cookie- and IP-based mechanism to prevent duplicated or fake votes.

4.2.4 Events

Events module is used for managing and displaying events. Events are actions that take place at a given time at a given venue, e.g. a conference, a football match, or an exhibition. This module allows setting date (or interval), venue, category, and description for each event. Date of beginning and short description are required fields; venue, category, and long description optional fields. On the front-end, Events module allows filtering and searching events by all fields. In projects, adjustments of this module are expected because the default version never fits all possible requirements in this area.

4.2.5 Forms

Forms module allow the user to intuitively create dynamic forms for the website. These forms can contain variable types of fields. All form fields and settings are configurable by the user in the administration. This module will be further described in section discussing the form library.

4.2.6 Galleries

Gallery works as a container for publishing photo galleries on the web. A component of this module contains only one element, that is the gallery. Each gallery consists of multiple images with thumbnails, large pictures, and short description. Gallery is available both as standard and linkable module, so it can work as a stand-alone component or be attached to other element, e.g. an article.

4.2.7 News

News is a commonly used module for publishing short news on the website. This module is similar to articles and allows multiple news per component. Required fields are perex and date of publication, optional parameters include long text and preview image. A news' detail is available only if the long text is filled in.

4.2.8 QuickContact

QuickContact module provides a basic contact form with three fields: subject, e-mail, message. Unlike dynamic forms created with Form module, quick contact form is static and immutable. This module will also be described in section discussing the form library.

4.2.9 RSS

RSS is a XML-based technology for publishing frequently changed content in a standardized format [42]. This module conforms to the RSS 2.0 specification and provides option to create feeds from website content. Each element of this module equals a single feed, each component therefore contains list of one or more feeds. Data source is chosen from active component-on-page pairs where the component must be derived from a feedable module. Feedable modules are currently News and Articles, e.g. those having textual content and multiple elements per component.

4.2.10 Search

Search module allows user to search content of the actual presentation. Details of this module are described in a separate section that concentrates on search methods and implementation in Urchin CMS.

4.2.11 Sitemap

Sitemap is a simple module that renders hierarchy of pages for the actual presentation. The tree of pages contains all levels of the hierarchy and includes links to all pages. This module provides a common feature that helps the visitor to navigate the website.

4.3 Dynamic Forms

Forms enable a visitor of the website to input and send data to the web application. Forms are commonly used for basic interaction with the user. Web forms look like their paper predecessors and include input elements such as text fields, radio buttons, or check boxes. Forms are defined in HTML, but require a server-side program to process data. Forms on the website are most often used for search, registration, ordering products, sending comments, or contacting a website owner. Figure 4.1 shows example of a simple contact form.

Figure 4.1: Simple contact form with three mandatory fields: subject, e-mail, and message

Simple contact form with three mandatory fields: subject, e-mail, and message

Web forms are declared inside a form HTML tag that defines method for its submit and includes form fields. The method is either POST or GET, as defined in the HTTP standard [19]. Forms fields provide many common graphical user interface elements. These elements are input, textarea, password, file, select, radio, checkbox, submit, and reset. However, tree views or combo boxes are not supported. Labels serve as titles connected to the fields. The following text discusses different approaches to form processing.

4.3.1 Form Implementation

In many web applications, forms are implemented from scratch. Implementation of a form consists of several steps: defining form, validating user input, handling errors, and processing data. Defining a form includes coding form and field tags, managing formatting, attaching labels, and setting up default values. Form validation requires defining mandatory fields, validating rules, and error messages. Handling errors includes redirecting back to the form, displaying error message and pre-selecting form values. Submitted data are processed and sent to an-email or saved into database tables.

As described in the preceding text, there are many operations necessary to create even a simple form. Manual form processing is time-consuming, non-trivial, and error prone. In advanced web applications, some or all these steps are automated to speed up development and prevent errors. Urchin CMS includes an integrated library that is used exactly for these purposes.

4.3.2 The Form Library

The form library is a tool utilized for automatized form building and processing. The purpose of the form library is to simplify form definition, validating, and processing. The library does not cover additional operations, such as saving data or sending e-mails. These operations must be implemented individually. The library is usable both in the front-end and in the administration, although it is primarily intended for the front-end. In administration, crud-based generated forms are more common. Form fields are defined similarly to crud instances using controls and validators.

4.3.3 Controls

Controls are objects that encapsulate common web form fields existing in HTML, such as inputs, radios, and check boxes. In addition to rendering these elementary fields, controls enable many smart features. These features include setting and validating data format, handling default values, attaching a label, and control rendering as a part of the form. Input format is determined by the type of the control. Validators are attached to the controls and used to check the input. All controls keep values filled in if the form's submit did not succeed as well as displaying default values by the form definition.

The form library comes with many different types of controls. Basic text controls include simple input and textarea, other are used for date, time and their variants. Advanced controls are selection and multi-selection boxes, radios, check boxes. Submit controls allow sending the form, hidden control enables adding additional parameters, text control is used to display a custom text inside the form. Security controls are used to prevent duplicated submit, spam, or CSRF attacks. These controls will be discussed in a separate text. A control group object creates a group of controls that is displayed as a fieldset. Controls for uploading files are planned for the future.

4.3.4 Validators

Validators are used to check the user's input in the form. There are many types of validators that are attached to controls, each control could have assigned any number of validators. Common types of validators check if the required fields are filled in, check the input length, or match user-provided values against a pattern. Regular expression patterns are often utilized to control date, integer, e-mail, or url format. Special validators are attached to security controls to help protecting the form.

4.3.5 Form Processing

The form library is responsible for the complete process of form processing. A variant of this process is illustrated in figure 4.2. This variant is used for common one-step forms, such as those in content modules. Form settings and fields are defined in the controller and available to all views and actions in the process for rendering the form and validating its values.

Figure 4.2: Form processing diagram with views and actions

Form processing diagram with views and actions

Initially, the view with the form is displayed. After submitting the form, the send action is triggered. In the send action, the form is validated using defined rules. If the validation succeeds, the form is saved and the user is redirected to the feedback page. If not, the user is returned back to the view with visible error messages. Saving failure leads to the feedback page with a return link. Form values are kept after either validation or saving failure to allow review and resubmit.

4.3.6 Form Security

The form library provides four complementary methods to secure web forms. The first method uses a unique token that helps preventing CSRF attacks and duplicated submit of the form. It works exactly as previously described in the Cross-Site Request Forgery section. This token-based mechanism is always present in the form, it cannot be detached because of possible security risks. Remaining three methods are implemented using controls and validators to prevent spamming and sending the form by robots. These methods are not mandatory, although strictly recommended.

The first control is simply called antispam. It requests the user to fill in a sum of two randomly generated integers. The filled number is then compared with the sum that has been saved in session. The delay control tracks the time elapsed between form displaying and its sending. If the time is lower than the defined interval, the message is considered spam. This protection relies on the fact that a human filling in the form with meaningful data cannot fill it in just few seconds like a robot. Anyway, the delay interval must be chosen very carefully.

The last type of protection is called honey pot. Honey pot control works as a logical protection. Basic idea is to enhance the form with auxiliary fields that use common names but logically do not belong to the form. These additional fields are hidden to the human visitor. In example, a form has three fields: name, surname, and address. An auxiliary field could be e.g. e-mail or phone number. The honey pot control then checks if this field is empty after the form has been submitted. Spamming robots could not distinguish these extra fields and fill them anyway. However, this method has no effect against a human spammer.

4.3.7 QuickContact Revisited

QuickContact module makes a good example of using the form library. As already mentioned, this module contains a simple contact form with several input fields. Fields are defined statically, e.g. cannot be changed. The form includes all security controls discussed in the security section and four standard controls. Table 4.2 lists all standard controls and attached validators. Form messages are sent to the e-mail and saved to the database.

Table 4.2: Standard controls and validators used in the QuickContact module

form field	control	validators	notes
subject	input	not empty
mail	input	not empty, regular expression	valid e-mail
type	selection	n/a	e.g. demand or inquiry
message	text area	not empty

4.3.8 Forms Revisited

The Forms module is used for creating custom forms in the front-end. Forms, their settings and fields are managed in the administration. This process is simple, intuitive and does not require any knowledge of programming. The user can edit form fields, recipients, text displayed after successful and failed submit, and mail content. Form fields are divided into logical groups, each with several fields. Each field has custom settings, such as adding options for radios, setting the field as mandatory, or defining allowed length or range. Incoming messages are sent via e-mail to defined recipients and logged into the database.

Form fields in the module are based upon form library controls. Each field equals a single control with one or more attached validators. Some validators are always present (format validation), use of others depends on the field settings. Table 4.3 displays controls available for this module with additional information. Grouping of fields using fieldsets is allowed, up to one level without nesting. Uploading of files is currently not supported, but it is planned for the future. Each dynamic form by default includes all security controls.

Table 4.3: Form fields available in the Form module

form field	control object	settings & validators
short text	input	required, length
integer	input	required, range, positive only
decimal	input	required, range, positive only, decimal places
e-mail	input	required, add to recipients
url	input	required
long text	text area	required, rows
date	date	required, interval
date & time	date time	required, interval
time	time	required, interval
list	select	required, default option
switch	radio	options, default option
yes/no	radio	default option
checkboxes	check box	options, default, checked 1+
multiple list	multi select	options, default, rows, selected 1+
displayed text	text	text content
parameter	hidden	parameter value
group of fields	control group	n/a

4.4 Content Search

The coming text describes searching in the content of a website or a content management system. The first part of this section discusses two common approaches utilized for search in small- to medium-size web applications. These two approaches are internal and external search methods. Internal methods include entity-based search, content indexing, and a combination of both. External methods include search engine services and external search engines. The difference between both types is that internal search is a part of the application while external search is usually a third party service or program that is connected via a public interface. This text discusses typical use and pros and cons of every approach.

The second part describes in detail design of the search facility in Urchin CMS. The search sub-system in the application is composed of modules Index and Search and utilizes the content indexing approach to search in the website content and entity-based method to search in pages. Both parts are focused on search in the website content stored in database, searching in independent files or multimedia is not covered in this text. There are also other general search methods not discussed in this work, including inverted indexes or NoSQL databases.

4.4.1 Entity-Based Search

Entity-based search is the first of three here explained internal search methods. The idea of this principle is to provide search option for each entity independently on other entities. The searched entity is typically a single database table or a module with multiple tables. The search is performed directly in content tables without the need for content indexing. Figure 4.3 displays a simple diagram that illustrates the entity-based search. The picture includes the querying part with three sample tables. In contrast with the content indexing method, the indexing part is not present at all.

Figure 4.3: Schema of the entity-based search method

A search process for this approach iterates all searchable tables and runs a customized query to search the keyword in each table. Found records are then displayed to the user. These records are usually first sorted by the entity (e.g. articles first, products last) and only then sorted by relevance or other criteria. This organization of results is very common for this approach.

The entity-based approach has many advantages and drawbacks. On one side, this methods works well with applications that contain many diverse tables or ad-hoc structure. It is also simple to implement and enough flexible to fit individual module's requirements. On the other side, problematic areas are sorting results across entities, limited performance and presence of additional data not related to search. In example, the MySQL database currently does not support foreign keys and full-text indexes in the same table. So the developer must choose between these two often mission-critical options. If the first option is selected, the search cannot benefit from full-text indexes and always performs a full table scan.

4.4.2 Content Indexing

Content indexing is more advanced and complex method how to implement search on the website. It is hugely inspired by data warehousing and business intelligence solutions. Main principle of this approach is to divide the search functionality into two parts, content indexing and content querying. Both parts share a database table that is not directly connected to the schema and stores data required for search. Figure 4.4 displays simple schema with both parts of the process and the indexing table. The indexing part is responsible for indexing content from the website into the shared table. The querying part provides the search itself, e.g. searches for the keyword in the indexed content.

Figure 4.4: Schema of the context indexing search method

The indexing process runs at a given time using cron or similar scheduling mechanism. The indexing interval depends on the purpose of the website, e.g. a news server requires much lower interval than a company presentation. This process also iterates all searchable tables and then indexes their content. Content indexing includes several steps: tracking changes, removing obsolete records, updating changed records and parsing new content. The querying part works similarly to the entity-based method. Instead of searching in content tables, the indexing table is searched for the keyword.

The content indexing approach has also some drawbacks and many advantages. The main disadvantage is increased complexity of this approach in comparison with the previous method. The application must be well-designed from the start to support this approach. A good example of such architecture is Urchin CMS with its concept of the component axis as will be discussed later. Other drawback is delay between content change and its indexing. The most significant advantages are related with the indexing table. The indexing table stores data in a format perfectly suitable for searching, utilizes full-text indexes for much better performance, does not contain unrelated data. It also enables trivial retrieval, sorting and filtering of found records.

4.4.3 Combined Search

The content indexing method could be combined with the entity-based approach for various reasons, e.g. if the content indexing cannot be employed for all tables or to satisfy specific requirements. There are two options for combining both approaches. The first option just complements both methods. In example, the content indexing is used for articles and news while the entity-based search works with pages. The second option is slightly different, it extends the content indexing method and uses multiple indexing tables for different purposes. An example in figure 4.5 uses two indexing tables, one table for text content and the other one for indexing e-shop categories and products.

Figure 4.5: Schema of the combined search method

4.4.4 Search Engine Service

The most simple and straightforward approach for adding an external search option to the website is to employ a third-party search engine. Many companies that maintain search engines on the internet, such as Google [13] or Bing [5], also provide search solutions for individual websites. This includes free but limited Google Custom Search [14] and Bing Box [6] for minor projects, or paid Google Site Search [15] for enterprise-level solutions. This method does not require any advanced programming knowledge. Together with usability for static websites, this approach provides a relevant solution for minor public projects. Typical drawbacks of this approach are indexing data by a third party, limited capacity or advertisement.

In example, implementing Google Custom Search includes three steps: setting up the engine, adding a search button, and creating a landing page. Setting up the search engine requires logging into the service and just clicking the create engine button. The engine then enables basic settings and further customization. Settings include sites to search, an important setting 'search only these sites', keywords, visual appearance, advertisement and other options. Two fragments of HTML code are generated after the user has finished customizing the engine. This first piece of code includes a search box, the second piece is used for displaying search results on the landing page.

4.4.5 External Search Engine

External search engines are applications that are also used for implementing advanced search on the website. Opposite to the search engine services, these programs are implemented by the website developer, not a third party. External search engines index data similarly to the content indexing approach, although they are not part of the application nor its database. Communication with the engine is realized via a public interface. Most notable open-source projects of this type are Apache Lucene [2] along with its variant Apache Solr [3], and Sphinx [43]. External search engines are highly effective, suitable for high-load projects, and able to handle any type of text files.

4.4.6 Index Module

The search facility in the Urchin application consists of two modules, Index module and Search module. Index module directly applies principle described in the general description of the content indexing approach. Figure 4.6 illustrates all important parts of the indexing process. The indexing process is triggered by the cron service at a given time once per defined interval, e.g. content is not indexed immediately after it changes. The administration also provides a simple interface to manually run the indexing. This interface is by default available only to the admin user group.

Figure 4.6: Sequence diagram for the indexing process

Sequence diagram for the indexing process

First, all indexable modules are retrieved from the table a_module. After that, all active components are selected for each retrieved module. Active components are those assigned to displayed pages on the front-end. The last step is parsing searchable content from elements to the indexing table entitled a_search_data. Obsolete records are removed from the table, changed elements are updated, and new elements added. The list of indexable modules is cached and unchanged elements are not parsed at all to reduce the load of the process.

4.4.7 Search Module

Search module is used for searching the website content within the selected presentation. This module is by default included in each installation and is composed of a search box and a result page. The search box is part of the layout and the result page is a system page, therefore it is always present and not deletable. The search process is simple and combines the indexing approach with the entity-based method. The former is used for search in the indexed content and utilizes the full-text index and options. The latter is used for simple regular expression-based search in page titles.

Figure 4.7 displays all important steps of the search process. First, the user sends a search request with desired keyword(s). Then the a_page table is searched for titles matching the keyword and the a_search_data is searched for text content containing the same keyword. Both types of records are merged and sent to the result page. Records displayed to the user are sorted by relevance and include keyword highlighting. All searched terms are logged to support basic search statistics.

Figure 4.7: Sequence diagram for the search process

4.4.8 Indexing Table

As discussed before, the indexing table is a data structure used for storing website content in a form suitable for fast and efficient full-text search. Design of this table is based on the OLAP approach instead of the OLTP approach that is commonly used in transaction-dependent applications such as content management systems. For this reason, the indexing table is denormalized and not related to the application's database schema. It works similarly to a dimensional table in a data warehouse. Data in this table are extracted from other tables by the indexing service and the table itself is used solely for querying its data.

The indexing table stores information about records, indexed text, two titles, and a timestamp. Record information include ids of presentation, module, component, and element. These ids are used to determine the presentation, join pages to the records, and quickly lookup for auxiliary data if required. Information about pages is not stored in the table, pages are joined to the records after the search has been performed. This is a trade-off between performance and complexity of the indexing table. Indexed text is used for the full-text search and component and element titles are displayed in search results. The timestamp is used for detecting changes in elements to reduce the amount of indexed elements.

[Pages 39-52]

Josef Kunhart

Diploma thesis, Ing.