3 Jul 2011

Read Docx in Java

How to read document docx in java??
Berikut source code bagaimanan membaca isi dokumen berekstensi docx pada java. Pada java, untuk membaca file berekstensi docx tidak memerlukan library khusus, teknik pembacaan dokumen docx ini dengan mengekstrak menggunakan zipEntry. Kemudian dari hasil ekstrakan tersebut disimpan pada file xml sebagai penyimpanan temporary, dan terakhir membaca file xml tesebut.



Berikut source code membaca file docx menggunakan java:
import java.io.ByteArrayInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

public class JavaReadDocx {
    public static void main(String[] args) throws IOException, ParserConfigurationException, SAXException {
        FileInputStream file = new FileInputStream("my.docx");

        ZipInputStream docXFile = new ZipInputStream(file);
        ZipEntry zipEntry;
        OutputStream out;
        String xml = "";
        while ((zipEntry = docXFile.getNextEntry()) != null) {
            if (zipEntry.toString().equals("word/document.xml")) {
                byte[] buffer = new byte[1024 * 4];
                long count = 0;
                int n = 0;
                long size = zipEntry.getSize();
                out = System.out;

                while (-1 != (n = docXFile.read(buffer)) && count < size) {
                    xml += new String(buffer, 0, n);
                    count += n;
                }
            }
        }
        InputStream is = new ByteArrayInputStream(xml.getBytes("UTF-8"));
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder parser = factory.newDocumentBuilder();

        Document document = parser.parse(is);

        NodeList sections = document.getElementsByTagName("w:t");
        String isidocx = "";
        for (int i = 0; i < sections.getLength(); i++) {
            isidocx += sections.item(i).getFirstChild().getNodeValue();
        }
        System.out.println(isidocx);
    }
}
Semoga bermanfaat

2 komentar:

  1. mas ini iwan, saat saya coba yang muncul seperti di bawah, mohon penjelasan, file my.docx nya di simpan di mna. terimakasih
    dan mohon kirim balasan ke email saya ....
    iwan_kancil81@yahoo.com

    Exception in thread "main" java.io.FileNotFoundException: my.docx (The system cannot find the file specified)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.(FileInputStream.java:106)
    at java.io.FileInputStream.(FileInputStream.java:66)
    at JavaReadDocx.main(JavaReadDocx.java:17)
    Java Result: 1
    BUILD SUCCESSFUL (total time: 0 seconds)

    BalasHapus
  2. itu memang exception karena tidak menemukan file .docx nya...
    penempatan file docx itu diletakkan didalam project tanpa masuk ke folder2 lain... misal anda membuat project ReadDocx, maka file my.docx ditempatkan di dalam folder ReadDocx tersebut. gak usah masuk ke folder src atau yg lain. terima kasih.

    BalasHapus