How to read word document in php

For docx use this function. It will return text from docx


function read_docx($filename)
{

    $striped_content = '';
    $content = '';

    if(!$filename || !file_exists($filename)) return false;

    $zip = zip_open($filename);
    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }
    zip_close($zip);      
    $content = str_replace('', " ", $content);
    $content = str_replace('', "\r\n", $content);
    $striped_content = strip_tags($content);

    return $striped_content;
}

“docx” is different from “doc”. Docx files are basically xml files in a zipfile container (as described by wikipedia). Doc files are binary blobs.

I am aware of no library that can easily read docx files in php (although Phpdocx can write them). However, since these are just zip files and xml files, you should be able do put something together using ZipArchive to open the docx container and DOMDocument or SimpleXML or XMLReader or XSLTProcessor to read the xml documents themselves.

Leave a Comment

Your email address will not be published. Required fields are marked *