Read and extract searched text from pdf file using iTextSharp in ASP.Net

Last Reply 11 months ago By Sneha123

Posted 11 months ago

I am working for text search and extraction from pdf using third party dll itextsharp.

I am getting the text on searching but not only that text, the whole text of that page.

I thought to use phrases or chunks so that I can get pre-and post of that text only along with it instead of whole page text. Can anyone suggest me code for phrases or anything else which I can use for it. Thanks!

My code is:

string searchText = null;
string filename = System.AppDomain.CurrentDomain.BaseDirectory;
filename = @"C:\test.pdf";
searchText = textBox.Text.ToString();
List<int> pages = new List<int>();

if (File.Exists(filename))
{
    PdfReader pdfReader = new PdfReader(filename);
    List<Phrase> PhraseList = new List<Phrase>();
    for (int page = 1; page <= pdfReader.NumberOfPages; page++)
    {
        ITextExtractionStrategy strategy = SimpleTextExtractionStrategy();
        string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
        if (currentPageText.Contains(searchText))
        {
            pages.Add(page);
            textBox1.AppendText(PdfTextExtractor.GetTextFromPage(pdfReader, page));
            textBox1.Text += pages.ToString();
        }
    }
    pdfReader.Close();
}

 

Posted 11 months ago

Hi Sneha123,

I will get back to you soon.


Posted 11 months ago

cool!!! Thanks


Posted 11 months ago

Hi Sneha123,

Refer the below code.

string file = Server.MapPath("~/test.pdf");
if (System.IO.File.Exists(file))
{
    string searchText = textBox.Text.Trim();
    string currentText = string.Empty;
    System.Text.StringBuilder pdfText = new System.Text.StringBuilder();
    iTextSharp.text.pdf.PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(file);
    for (int page = 1; page <= pdfReader.NumberOfPages; page++)
    {
        iTextSharp.text.pdf.parser.ITextExtractionStrategy strategy = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
        currentText = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
        currentText = System.Text.Encoding.UTF8.GetString(System.Text.Encoding.Convert
            (System.Text.Encoding.Default, System.Text.Encoding.UTF8, System.Text.Encoding.UTF8.GetBytes(currentText)));
        pdfText.Append(currentText);
    }
    pdfReader.Close();
    List<string> lines = new List<string>();
    lines = pdfText.ToString().Trim().Split(' ').ToList();
    List<string> matchedWord = new List<string>();
    foreach (string item in lines)
    {
        if (!string.IsNullOrEmpty(item))
        {
            if (item.ToUpper().Contains(searchText.ToUpper()))
            {
                matchedWord.Add(item);
            }
        }
    }
}

 


Posted 11 months ago

Thank u so much Dharmendr..It helped!!!

I agree, here is the link: https://www.e-iceblue.com/Introduce/spire-office-for-net-free.html