For docx use this function. It will return text from docx
function read_docx($filename)
{
$striped_content = '';
$content = '';
if(!$filename || !file_exists($filename)) return false;
$zip = zip_open($filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != "word/document.xml") continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}
zip_close($zip);
$content = str_replace('', " ", $content);
$content = str_replace('', "\r\n", $content);
$striped_content = strip_tags($content);
return $striped_content;
}
“docx” is different from “doc”. Docx files are basically xml files in a zipfile container (as described by wikipedia). Doc files are binary blobs.
I am aware of no library that can easily read docx files in php (although Phpdocx can write them). However, since these are just zip files and xml files, you should be able do put something together using ZipArchive
to open the docx container and DOMDocument
or SimpleXML
or XMLReader
or XSLTProcessor
to read the xml documents themselves.