Iteratively search in a website (for dummies)

1 visualización (últimos 30 días)
Luca D'Angelo
Luca D'Angelo el 3 de Mayo de 2024
Respondida: Luca D'Angelo el 9 de Mayo de 2024
Hi all,
I have a list of thousands of chemical formula (or potentially formula). What I'd like to do is to iteratively get one of this formula (for i=1:size(FormulaList,1)....end), insert the formula into the search bar of the website (that is: https://pubchem.ncbi.nlm.nih.gov/ ), and check if I have a possible matches or I get something like this ("0 results found"):
I've tried to apply the method described here ( https://it.mathworks.com/matlabcentral/answers/400522-retrieving-data-from-a-web-page ) but I was not able to understand how to get the "curl" (sorry: I'm completely ignorant in this!).
Cheers,
Luca
[SL: removed the parenthesis from the end of one of the hyperlinks]

Respuesta aceptada

Luca D'Angelo
Luca D'Angelo el 9 de Mayo de 2024
I've found the solution.
% MassList: column-vector with molecular formula
tic
for mass=1:size(MassList,1)
url=strcat('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/',MassList(mass,1),'/cids/JSON?list_return=cachekey');
try
jsonData = webread(url);
ResNum(mass,1)=jsonData.IdentifierList.Size;
catch
ResNum(mass,1)=0;
end
pause(0.205) % the website asks for max 5 requests / second
end
toc
The resulting column-array provides the number of compounds with the same molecular formula found in PubChem.

Más respuestas (1)

Steven Lord
Steven Lord el 3 de Mayo de 2024
Your best bet is probably to use one of the access methods that PubChem provides, as described on this page. Note the usage policy. If you have thousands of requests it's likely going to take minutes or longer, or the bulk data downloads functionality linked in the usage policy may be a better fit for your needs.
From the MATLAB side of things, the functions in this documentation category likely will be of use to you as may be the functions on this documentation page. [Before you ask no, I don't have any examples specific to using those functions to access that database.]
  3 comentarios
Steven Lord
Steven Lord el 6 de Mayo de 2024
You haven't shown us what values you're using for the maxAttempts and waitTime variables in your code.
Luca D'Angelo
Luca D'Angelo el 6 de Mayo de 2024
opt=weboptions("Timeout",5);
molecularFormula = 'C9H8O4'; % Example molecular formula
apiUrl = sprintf('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/formula/%s/cids/JSON', molecularFormula);
maxAttempts = 10; % Maximum number of attempts
waitTime = 5; % Time to wait between attempts (in seconds)
attempt = 1;
while attempt <= maxAttempts
jsonData = webread(apiUrl);
if ~isfield(jsonData, 'Waiting') || isempty(jsonData.Waiting) %|| ~strcmpi(jsonData.Waiting, 'true')
break; % Exit loop if request is not waiting anymore
end
attempt = attempt + 1;
pause(waitTime);
end
% Check if the request is still processing after the loop
if isfield(jsonData, 'Waiting') && ~isempty(jsonData.Waiting) && strcmpi(jsonData.Waiting, 'true')
disp('Your request is still processing. Please wait and try again later.');
return;
end
if isfield(jsonData, 'Fault')
disp(['Error: ', jsonData.Fault.Message]);
return;
end
numResults = 0; % Initialize number of results
if isfield(jsonData, 'IdentifierList') && isfield(jsonData.IdentifierList, 'CID')
numResults = numel(jsonData.IdentifierList.CID); % Number of search results
end
disp(['Number of results for molecular formula "', molecularFormula, '": ', num2str(numResults)]);
It doesn't really matter, actually. Most of the previous code was written by chatgpt but it's useless. The main lines are:
opt=weboptions("Timeout",5);
molecularFormula = 'C9H8O4'; % Example molecular formula
apiUrl = sprintf('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/formula/%s/cids/JSON', molecularFormula);
jsonData = webread(apiUrl);
if webread worked, maybe I would be able to find the information I am looking for. The problem is that I think the function launches the search but then doesn't wait for the website to ‘load’ the result, so it shows ‘Your request is still running’. Maybe I should find a way to launch the command, wait and then check if the webpage 'loaded' the results. What do you think?

Iniciar sesión para comentar.

Categorías

Más información sobre Genomics and Next Generation Sequencing en Help Center y File Exchange.

Productos


Versión

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by