Java XML Api

Who uses XML in 2021? We all use JSON these days, aren’t we? Well, it turns out XML is still being used. These code fragments could help get you up to speed when you’re new to the Java XML API.

Create an empty XML document

To start from scratch, you’ll need to create an empty document:


Document doc = DocumentBuilderFactory
                .newInstance()
                .newDocumentBuilder()
                .newDocument();

Create document from existing file

To load an XML file, use the following fragment.


File source = new File("/location/of.xml");
Document doc = DocumentBuilderFactory
                    .newInstance()
                    .newDocumentBuilder()
                    .parse(target);

Add a new element

Now that we have the document, it’s time to add some elements and attributes. Note that Document and Element are both Nodes.


final Element newNode = doc.createElement(xmlElement.getName());
newNode.setAttribute(attributeName, attributeValue);
	
node.appendChild(newNode);

Get nodes matching XPath expression

XPath is a powerful way to search your XML document. Every XPath expression can match multiple nodes, or it can match none. Here’s how to use it in Java:


final String xPathExpression = "//node";
final XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodeList = (NodeList) xpath.evaluate(xPathExpression, 
        doc, 
        XPathConstants.NODESET);

JSON to XML

JSON is simpler and much more in use these days. However, it has less features than XML. Namespaces and attributes are missing, for example. Usually these aren’t needed, but they can be useful. To convert JSON to XML, you could use the org.json:json dependency. This will create an XML structure similar to the input JSON.

<properties>
	<org-json.version>20201115</org-json.version>
</properties>

<dependencies>
	<dependency>
		<groupId>org.json</groupId>
		<artifactId>json</artifactId>
		<version>${org-json.version}</version>
	</dependency>
</dependencies>

JSONObject content = new JSONObject(json);
String xmlFragment = XML.toString(content);

Writing XML

When we’re done manipulating the DOM, it’s time to write the XML to file or to a String. The following fragment does the trick


private void writeToConsole(final Document doc) 
    throws TransformerException{
	final StringWriter writer = new StringWriter();
	writeToTarget(doc, new StreamResult(writer));
	System.out.println(writer.toString());
}

private void writeToFile(final Document doc, File target) 
    throws TransformerException, IOException{
	try (final FileWriter fileWriter = new FileWriter(target)) {
		final StreamResult streamResult = new StreamResult(fileWriter);
		writeToTarget(doc, streamResult);
	}
}

private void writeToTarget(final Document doc, final StreamResult target) 
    throws TransformerException {
	final Transformer xformer = TransformerFactory.newInstance()
              .newTransformer();
	xformer.transform(new DOMSource(doc), target);
}

Leveraging Lucene

Imagine a catalog of a few hundred thousand items. These items have been labeled into a few hundred categories. Each item can be linked to up to three categories. New categories need to be added in order to make things easier to find. However, categories without content are useless. So some content needs to be linked to the new categories. Luckily both the items and the categories have a description, and that makes things easier.

The idea is simple:

  • Put all items into a searchable index.
  • For each category, find out what the most important words are.
  • Create a search term using these most important words, by just sticking them together.
  • Search the index of items for best matches.

And we’re done, sort of. It’s a bit more complicated than that, but that’s mostly because of finetuning.

Building the searchable index

This is where Apache Lucene comes in. Lucene is an open source full text indexing and search library, supported by the Apache Software Foundation.
First released in 1999, it is still in active development.

To create an index, you need to create an IndexWriter, and use it to add Documents to the index.

public IndexWriter createIndexWriter() throws IOException {
        Directory indexDir = FSDirectory.open(Paths.get(INDEX_DIR));
        Analyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig icw = new IndexWriterConfig(analyzer);
        icw.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

        return new IndexWriter(indexDir, icw);
    }

Only one writer is needed for adding items to the index. Note that IndexWriterConfig.OpenMode.CREATE will create a new index.
If there is anything already in the index, it will be removed. IndexWriterConfig.OpenMode.CREATE_OR_APPEND could be used if you want to add to an existing index.

Next up is actually adding things to the index. Each item in the index is called a Document. Documents have Fields that can be used for searching.

Creating and adding a document can be done like this:

try {
	Document document = new Document();
	document.add(new StringField("url", "http://ghyze.nl", Field.Store.YES));
	document.add(new TextField("title", "My awesome blog",  Field.Store.YES));
	document.add(new TextField("description", "Blogging about things",  Field.Store.YES));

	writer.addDocument(document);
} catch (IOException e){
	e.printStackTrace();
}

When we’re done with adding the documents, we need to close the IndexWriter:

writer.close();

Find important words

Let’s first define what an “important word” is. Important words are words that are the most relevant for each document in the collection.
There is a nice algorithm to determine the relevance of each word: tf-idf (short for term frequency–inverse document frequency).

The premise of this algorithm is that a word is more relevant for a document the more it appears in that document, and less relevant for a single document when it appears in more documents.

  • For each document, we count how many times each word appears and we divide that by the total number of words in this document. This is the term-frequency part.
  • For each word in every document, take the total number of documents and divide it by the number of documents containing this word. We don’t want this number to be too large, so we take the log of this. This is the inverse document frequency part.
  • Multiply these numbers to get the relative importance of each word for every document. A Higher score means the word is more important.

The code for this algorithm consists of two classes and a value object

De first class represents a single document in the collection. It is responsible for calculating the importance of the words it contains in relation to the words of all other documents.

import lombok.Getter;

import java.util.ArrayList;
import java.util.Collection;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class Document {

    /**
     * The identifier for this document
     */
    @Getter
    private final String id;

    /**
     * The complete text for this document
     */
    @Getter
    private final Collection<String> lines;

    /**
     * Every word in this document, with the number of times it appears
     */
    private Map<String, Integer> wordsInDocument;

    /**
     * Constructor
     */
    public Document(String id, Collection<String> lines){
        this.id = id;
        this.lines = lines;
    }

    /**
     * Get the map with the unique words in this document, and the number of times they appear
     */
    public Map<String, Integer> getWordsInDocument(){
        if (wordsInDocument == null){
            calculateWordMap();
        }
        return wordsInDocument;
    }

    /**
     * Calculate the number of times each unique word appears in this document
     */
    private void calculateWordMap(){
        wordsInDocument = new HashMap<>();
        for (String line : lines){
            String[] words = line.split("\\s");
            for (String word : words){
                if (word.trim().length() > 1) {
                    Integer wordCount = wordsInDocument.getOrDefault(word, Integer.valueOf(0));
                    wordsInDocument.put(word, wordCount+1);
                }
            }
        }
    }

    /**
     * Calculate the importance of each word, compared to other words in this document
     * and all other documents in the index
     * @param index The collection of documents that also contains this document.
     * @return An ordered list indicating the importance of each word in this document.
     */
    public List<WordImportance> calculateWordImportance(WordIndex index){
        List<WordImportance> wordImportance = new ArrayList<>();
        double totalWordsInDocument = getWordsInDocument().values().stream().mapToInt(Integer::intValue).sum();
        double totalNumberOfDocuments = index.getNumberOfDocuments();
        for (String word : getWordsInDocument().keySet()){
            double tf = ((double) getWordsInDocument().get(word)) / totalWordsInDocument;
            double idf = Math.log(totalNumberOfDocuments / ((double) index.getNumberOfDocumentsContaining(word)));
            wordImportance.add(new WordImportance(word, tf*idf));
        }

        // most important word first
        wordImportance.sort(Comparator.comparing(WordImportance::getImportance).reversed());

        return wordImportance;
    }
}

The next class represents the collection of all documents. It is responsible for calculating for each word the number of documents that contain it.

import lombok.Getter;

import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

public class WordIndex {

    /**
     * The index, key is the identifier of the document
     */
    @Getter
    private Map<String, Document> index = new HashMap<>();

    /**
     * A map with all the words in all the documents, and the number of documents containing those words
     */
    private Map<String, Integer> documentCountForWords = null;

    /**
     * Constructor
     */
    public WordIndex(){
    }

    /**
     * Add a document to the index, overwriting when it already exists.
     * @param document the document to add
     */
    public void addDocument(Document document) {
        if (index.containsKey(document.getId())) {
            System.out.println("Overwriting document with ID "+ document.getId());
        }

        index.put(document.getId(), document);
    }

    /**
     * Get all words in all documents. If a word appears multiple times, it is only returned once.
     * @return All words in this document
     */
    public Collection<String> getAllWords(){
        Set<String> allWords = new HashSet<>();

        index.values()
                .forEach(e -> allWords.addAll(e.getWordsInDocument().keySet()));

        return allWords;
    }

    /**
     * Get the map with number of documents per word
     * @return the map with number of documents per word
     */
    public Map<String, Integer> getDocumentCountForWords(){
        if (documentCountForWords == null){
            calculateDocumentCountForAllWords();
        }
        return documentCountForWords;
    }

    /**
     * Iterate over every word in every document, and count the number of documents that word appears in.
     */
    private void calculateDocumentCountForAllWords(){
        Collection<String> allWords = getAllWords();
        documentCountForWords = new HashMap<>();
        for (String word : allWords){
            for (String documentId : index.keySet()){
                Map<String, Integer> document = index.get(documentId).getWordsInDocument();
                if (document.keySet().stream().anyMatch(e -> e.equals(word))){
                    Integer count = documentCountForWords.getOrDefault(word, 0);
                    documentCountForWords.put(word, count+1);
                }
            }
        }
    }

    /**
     * Get the total number of documents
     * @return
     */
    public int getNumberOfDocuments(){
        return index.size();
    }

    /**
     * Get the number of documents this word appears in.
     * @param word The word we're interested in
     * @return The number of documents containing this word
     */
    public int getNumberOfDocumentsContaining(String word){
        Map<String, Integer> wordCount = getDocumentCountForWords();
        return wordCount.getOrDefault(word, 0);
    }
}

Then we have a value object, that is used by the document to indicate the relative importance of each word.

import lombok.AllArgsConstructor;
import lombok.Value;

@Value
@AllArgsConstructor
public class WordImportance {
    String word;
    double importance;
}

The easy steps

Now it’s time to search for content for the new categories. To do this, we take most important words of the new categories and string them together, separated by spaces. Then we use that search term to search the index, and extract the result.

This is what the code for performing the search would look like:

/** 
 * Prepare the search engine
 */
public void initializeSearch(){
	try {
		indexReader = DirectoryReader.open(FSDirectory.open(Paths.get(BuildStudyIndex.INDEX_DIR)));
		searcher = new IndexSearcher(indexReader);
		analyzer = new StandardAnalyzer();

		queryParser = new MultiFieldQueryParser(new String[]{"title", "shortDescription", "longDescription"}, analyzer);
	} catch (IOException e){
		e.printStackTrace();
	}
}

/**
 * Perform search
 */
public TopDocs search(Collection<String> words) throws IOException, ParseException{
	String searchTerm = String.join(" ", words);
	Query query = queryParser.parse(searchTerm);
	TopDocs results = searcher.search(query, 200000);
	return results;
}

The code to extract the search results would be something like this:

TopDocs result = search(searchTerms);
for (ScoreDoc hit : result.scoreDocs){
	Document found = searcher.doc(hit.doc);
	double score = hit.score;
}

Conclusion

There are some steps that we needed to do that I haven’t mentioned. But those are mainly plumbing and finetuning.

We have seen how to use Apache Lucene as a custom search engine. First we’ve built a searchable index, and later we have searched that index for relevant items.
We have also seen how to implement an algorithm that determines the most relevant words in a specific document, compared to the other documents in a collection.

The reason this works is that words have meaning. I know, stating the obvious. Each word gives meaning to the text, and this meaning has varying degrees of relevance to that text. The words that are most relevant to the text distinguish the meaning of the text from the other texts in the collection. We don’t need to know the actual meaning of the text, we just need to separate it from all other texts. Then, through the magic of search engines, we can match the texts that have the most similar meanings.

Spring Boot, MongoDB and raw JSON

Sometimes you want to store and retrieve raw JSON in MongoDB. With Spring Boot storing the JSON isn’t very hard, but retrieving can be a bit more challenging.

Setting up

To start using MongoDB from Spring Boot, you add the dependency to spring-boot-starter-data-mongodb

	<dependency>
		<groupId>org.springframework.boot</groupId>
		<artifactId>spring-boot-starter-data-mongodb</artifactId>
	</dependency>

And then you inject MongoTemplate into your class

@Autowired
private MongoTemplate mongoTemplate;

Inserting into MongoDB

Inserting JSON is just a matter of converting the JSON into a Document, and inserting that document into the right collection

String json = getJson();
Document doc = Document.parse(json);
mongoTemplate.insert(doc, "CollectionName");

Retrieving JSON

Retrieving JSON is a bit more complicated. First you need to get a cursor for the collection. This allows you to iterate over all the documents within that collection. Then you’ll retrieve each document from the collection, and cast it to a BasicDBObject. Once you have that, you can retrieve the raw JSON.

DBCursor cursor = mongoTemplate.getCollection("CollectionName").find();
Iterator iterator = cursor.iterator();
while (iterator.hasNext()){
   BasicDBObject next = (BasicDBObject) iterator.next();
   String json = next.toJson();
   // do stuff with json
}

Transforming raw JSON to Object

With Jackson you can transform the retrieved JSON to an object. However, your object might miss a few fields, since MongoDB adds some to keep track of the stored documents. To get around this problem, you need to configure the ObjectMapper to ignore those extra fields.

ObjectMapper mapper = new ObjectMapper().configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
MyObject object = mapper.readValue(json, MyObject.class);

Java: Remove an element from a List

One of the more common tasks in programming is removing a specific element from a list. Although this seems to be straight-forward in Java, it’s a bit more tricky.

Before we start, we should build our list:

   public ArrayList<String> createList(){
      ArrayList<String> myList = new ArrayList<String>();
      
      myList.add("String 1");
      myList.add("String 2");
      myList.add("String 3");
      myList.add("String 4");
      myList.add("String 5");
      
      
      return myList;
   }

Let’s say we want to remove the String “String 2”. The first thing that comes to mind is to loop through the list, until you find the element “String 2”, and remove that element:

   public List<String> removeFromListUsingForEach(List<String> sourceList){
      for(String s : sourceList){
         if (s.equals("String 2")){
            sourceList.remove(s);
         }
      }
      return sourceList;
   }

Unfortunately, in this case, this will throw a

java.util.ConcurrentModificationException

I said “in this case”, because the exception is not always thrown. The details of this strange behavior is out of scope for this blogpost, but can be found here.

There are several ways to remove an element from a list. Depending on your personal preference, and which version of Java you use, here are some examples.

1. Use a for-loop which loops backwards
You can use a for-loop, which runs from the end of the list to the beginning. The reason you want to loop in this direction is, that when you’ve found the element you want to remove, you remove the element at that index. Every element after this one will shift one position towards the beginning of the list. If you’d run the loop forward, you’d have to compensate for this, which just isn’t worth the effort.

   public List<String> removeFromListUsingReversedForLoop(List<String> sourceList){
      for(int i = sourceList.size()-1; i >= 0; i--){
         String s = sourceList.get(i);
         if (s.equals("String 2")){
            sourceList.remove(i);
         }
      }
      return sourceList;
   }

This works in every Java version since 1.2, although you can’t use generics until Java 1.5.

2. Use an Iterator
Another way to remove an element from a list is to use an Iterator. The Iterator will loop through the list, and, if needed, can remove the current element from that list. This is done by calling

Iterator.remove()
   public List<String> removeFromListUsingIterator(List<String> sourceList){
      Iterator<String> iter = sourceList.iterator();
      while (iter.hasNext()){
         if (iter.next().equals("String 2")){
            iter.remove();
         }
      }
      return sourceList;
   }

This works in every Java version since 1.2, although you can’t use generics until Java 1.5.

3. Use Java 8 Streams
What you’re essentially doing here is make a copy of the list, and filter out the unwanted elements.

   public List<String> removeFromListUsingStream(List<String> sourceList){
      List<String> targetList = sourceList.stream()
            .filter(s -> !s.equals("String 2"))
            .collect(Collectors.toList());
      return targetList;
   }

This works since Java 1.8. More about Java 8 can be found here.

The Pomodoro Technique

One of my favorite time management methods is the Pomodoro Technique. The method is basically as follows:

  1. Pick a task.
  2. Work on it for 25 minutes. This is called a Pomodoro.
  3. Take a 5 minute break. Do something totally unrelated to your work.
  4. Work for another Pomodoro, or 25 minutes. This can be on the same task, or a new one if the previous one is finished.
  5. After 4 Pomodoros, take a longer break of 15 to 30 minutes.

Because of the breaks, your mind gets just enough rest to stay focussed. Also, because you’re supposed to work on one task, and one task only during your Pomodoro, the quality of your work can go up.

Ofcourse, the technique is completely customizable. Do you think 25 minutes is too short (or too long)? Try 45 minutes (or 10 minutes). Do you need a long break sooner? Do it every 3 Pomodoros.

No I’ve made my own Pomodoro tracker. It’s written in Java 8, and it’s open source. You can find the source here . A runnable version can be found here.

This is still work in progress, and I mainly made it as a challenge to myself. If you like it, do whatever you want with it.

The pomodoro is running!
The pomodoro is running!

When the screen and the tray icon are red, a pomodoro is running. Right clicking the tray icon will bring up a menu.

Pomodoro break.
Pomodoro break.

When the screen and the tray icon are green, you are on a break.

Waiting for the next Pomodoro to be started.
Waiting for the next Pomodoro to be started.

When the screen and the tray icon are blue, the program is waiting.

Changing the settings
Changing the settings

The settings allow you to specify the location of the program, the times and the number of Pomodoros between long breaks.

Java 8

On 18 march 2014, Oracle launched Java 8. I’ve had a little time to play with it. Here are my first experiences.

Eclipse

Eclipse 4.4 (Luna) will get support for Java 8. However, Eclipse 4.3.2 (Kepler) can support Java 8 by installing a feature patch. This page shows how to install the patch.

Once installed, you’ll need to tell your projects to use java 8. First add the JDK to eclipse:

  • Go to Window -> Preferences
  • Go to Java -> Installed JREs
  • Add Standard VM, and point to the location of the JRE
  • Then go to Compiler
  • Set Compiler compliance level to 1.8

Then tell the project to use JDK 1.8:

  • Go to Project -> preferences
  • Go to Java Compiler
  •  Enable project specific settings
  •  Set Compiler compliance level to 1.8

Now you should be able to develop your applications using Java 8.

Maven

To enable Java 8 in Maven, two things need to be done:

  1. Maven must use JDK 1.8
  2. Your project must be Java 8 compliant.

To tell maven to use JDK 1.8, point the JAVA_HOME variable to the correct location.
For the second point, make the project Java 8 compliant, add the following snippet to you pom.xml file:

<build>
  <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>
  </plugins>
</build>

Feature: streams

Streams can be used to iterate over collections, possibly in a parallel way. This has the advantage of making use of the multi-core architecture in modern computers. But more importantly, it makes the code shorter and more readable.

Case in point, consider the following code, for getting the minimum and maximum number in an array:

   public void minMax(int[] array){
      int min = array[0], max = array[0];
      for (int i : array) {
         if (i < min) {
            min = i;
         } else {
            if (i > max)
               max = i;
         }
      }
      System.out.println("Max is :" + max);
      System.out.println("Min is :" + min);
   }

Nothing too shocking. But with Java 8 this could be done shorter and easier:

   public void java8(int[] array){
      IntSummaryStatistics stats = 
            IntStream.of(array)
            .summaryStatistics();

      System.out.println("Max is :" + stats.getMax());
      System.out.println("Min is :" + stats.getMin());
   }

This method converts the array to an IntStream, and then collects the statistics of all numbers in that stream into an IntSummaryStatistics object. When testing this with an array of 10.000.000 items, spanning the range of 1.000.000 numbers, the performance is more than 5 times better with the first method though. The first running in 12 ms, the second in 69 ms.

Feature: lambda expressions

The biggest new feature of Java 8 is Lambda Expressions. These are sort of inline methods, and are mostly used in combination with streams. To explain this, let’s take a look at the following pieces of code. This will get all the files ending in “.csv” from a directory.

First, using a FilenameFilter:

      File sourceDir = new File("D:\\Tools");
      List<String> filteredList = Arrays.asList(sourceDir.list(new FilenameFilter(){

         @Override
         public boolean accept(File dir, String name)
         {
            return name.toLowerCase().endsWith(".csv");
         }
         
      }));

Now, using a Lambda:

      File sourceDir = new File("D:\\Tools");
      List<String> filteredList = Arrays.asList(sourceDir.list())
            .stream()
            .filter(s -> s.toLowerCase().endsWith(".csv"))
            .collect(Collectors.toList());

Notice line 4, with the filter command. This replaces the accept method in the FilenameFilter. What it effectively does is the following:

For each String in the stream:
 - assign the String to s
 - Call s.toLowerCase().endsWith(".csv"), this will return a boolean
 - If the result is true, the String is passed to the next method in the stream
 - If the result is false, the next String is evaluated